A comparison of latent semantic analysis and correspondence analysis of document-term matrices
A comparison of latent semantic analysis and correspondence analysis of document-term matrices
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.
Authorship attribution, Information retrieval, Singular value decomposition, Statistical methods, Text classification, Text data mining
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
18 May 2023
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Q., Hessen, David J., Deoskar, Tejaswini and Van Der Heijden, Peter
(2023)
A comparison of latent semantic analysis and correspondence analysis of document-term matrices.
arXiv, 8 (10).
(doi:10.1017/S1351324923000244).
Abstract
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.
Text
2108.06197
- Accepted Manuscript
Text
Updated: A comparison of latent semantic analysis and correspondence analysis of document-term matrices
- Accepted Manuscript
More information
Accepted/In Press date: 3 March 2022
e-pub ahead of print date: 18 May 2023
Published date: 18 May 2023
Additional Information:
Funding Information:
Author Qianqian Qi is supported by the China Scholarship Council.
Publisher Copyright:
© The Author(s), 2023. Published by Cambridge University Press.
Keywords:
Authorship attribution, Information retrieval, Singular value decomposition, Statistical methods, Text classification, Text data mining
Identifiers
Local EPrints ID: 455888
URI: http://eprints.soton.ac.uk/id/eprint/455888
ISSN: 2331-8422
PURE UUID: 831db218-42b9-42ee-bbb6-2de1c97b333b
Catalogue record
Date deposited: 07 Apr 2022 16:46
Last modified: 16 Apr 2024 04:02
Export record
Altmetrics
Contributors
Author:
Q. Qi
Author:
David J. Hessen
Author:
Tejaswini Deoskar
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics