A comparison of latent semantic analysis and correspondence analysis of document-term matrices
A comparison of latent semantic analysis and correspondence analysis of document-term matrices
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional and dense vectors that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on two tasks, a document classification task in English and an authorship attribution task on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Q., Hessen, David J., Deoskar, Tejaswini and Van Der Heijden, Peter
(2022)
A comparison of latent semantic analysis and correspondence analysis of document-term matrices.
arXiv.
(In Press)
Abstract
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional and dense vectors that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on two tasks, a document classification task in English and an authorship attribution task on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
Text
2108.06197
- Accepted Manuscript
Text
Updated: A comparison of latent semantic analysis and correspondence analysis of document-term matrices
- Accepted Manuscript
More information
Accepted/In Press date: 3 March 2022
Identifiers
Local EPrints ID: 455888
URI: http://eprints.soton.ac.uk/id/eprint/455888
ISSN: 2331-8422
PURE UUID: 831db218-42b9-42ee-bbb6-2de1c97b333b
Catalogue record
Date deposited: 07 Apr 2022 16:46
Last modified: 30 Oct 2023 06:21
Export record
Contributors
Author:
Q. Qi
Author:
David J. Hessen
Author:
Tejaswini Deoskar
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics