The University of Southampton
University of Southampton Institutional Repository

A comparison of latent semantic analysis and correspondence analysis of document-term matrices

A comparison of latent semantic analysis and correspondence analysis of document-term matrices
A comparison of latent semantic analysis and correspondence analysis of document-term matrices

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.

Authorship attribution, Information retrieval, Singular value decomposition, Statistical methods, Text classification, Text data mining
2331-8422
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Q.
f88a0e9c-6e23-40ce-ad86-7aba5680e947
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
Deoskar, Tejaswini
f85c9557-f301-4991-a139-5aaf1b01597a
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612

Qi, Q., Hessen, David J., Deoskar, Tejaswini and Van Der Heijden, Peter (2023) A comparison of latent semantic analysis and correspondence analysis of document-term matrices. arXiv, 8 (10). (doi:10.1017/S1351324923000244).

Record type: Article

Abstract

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.

Text
2108.06197 - Accepted Manuscript
Download (1MB)
Text
Updated: A comparison of latent semantic analysis and correspondence analysis of document-term matrices - Accepted Manuscript
Download (1MB)

More information

Accepted/In Press date: 3 March 2022
e-pub ahead of print date: 18 May 2023
Published date: 18 May 2023
Additional Information: Funding Information: Author Qianqian Qi is supported by the China Scholarship Council. Publisher Copyright: © The Author(s), 2023. Published by Cambridge University Press.
Keywords: Authorship attribution, Information retrieval, Singular value decomposition, Statistical methods, Text classification, Text data mining

Identifiers

Local EPrints ID: 455888
URI: http://eprints.soton.ac.uk/id/eprint/455888
ISSN: 2331-8422
PURE UUID: 831db218-42b9-42ee-bbb6-2de1c97b333b
ORCID for Peter Van Der Heijden: ORCID iD orcid.org/0000-0002-3345-096X

Catalogue record

Date deposited: 07 Apr 2022 16:46
Last modified: 16 Apr 2024 04:02

Export record

Altmetrics

Contributors

Author: Q. Qi
Author: David J. Hessen
Author: Tejaswini Deoskar

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×