The University of Southampton
University of Southampton Institutional Repository

Improving information retrieval through correspondence analysis instead of latent semantic analysis

Improving information retrieval through correspondence analysis instead of latent semantic analysis
Improving information retrieval through correspondence analysis instead of latent semantic analysis
The initial dimensions extracted by latent semantic analysis (LSA) of a document-term matrix have been shown to mainly display marginal effects, which are irrelevant for information retrieval. To improve the performance of LSA, usually the elements of the raw document-term matrix are weighted and the weighting exponent of singular values can be adjusted. An alternative information retrieval technique that ignores the marginal effects is correspondence analysis (CA). In this paper, the information retrieval performance of LSA and CA is empirically compared. Moreover, it is explored whether the two weightings also improve the performance of CA. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent often improves the performance of CA; however, the extent of the improvement depends on the dataset and the number of dimensions.
Information retrieval, Initial dimensions, Singular value decomposition, Singular value weighting exponent
Qi, Qianqian
a18a747f-c35a-4d27-b26d-1a064048dbc9
Hessen, David J.
f542c457-7a9c-412e-b5df-840b6fd0acd0
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Qianqian
a18a747f-c35a-4d27-b26d-1a064048dbc9
Hessen, David J.
f542c457-7a9c-412e-b5df-840b6fd0acd0
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612

Qi, Qianqian, Hessen, David J. and Van Der Heijden, Peter (2023) Improving information retrieval through correspondence analysis instead of latent semantic analysis. Journal of Intelligent Information Systems. (doi:10.1007/s10844-023-00815-y).

Record type: Article

Abstract

The initial dimensions extracted by latent semantic analysis (LSA) of a document-term matrix have been shown to mainly display marginal effects, which are irrelevant for information retrieval. To improve the performance of LSA, usually the elements of the raw document-term matrix are weighted and the weighting exponent of singular values can be adjusted. An alternative information retrieval technique that ignores the marginal effects is correspondence analysis (CA). In this paper, the information retrieval performance of LSA and CA is empirically compared. Moreover, it is explored whether the two weightings also improve the performance of CA. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent often improves the performance of CA; however, the extent of the improvement depends on the dataset and the number of dimensions.

Text
Qi et al. (2023) Journal_of_Intelligent_Information_Systems_Improving_information_retrieval_through_correspondence_analysis_instead_of_latent_semantic_analysis (1) - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (557kB)
Text
s10844-023-00815-y - Version of Record
Available under License Creative Commons Attribution.
Download (717kB)

More information

Accepted/In Press date: 25 August 2023
e-pub ahead of print date: 9 September 2023
Published date: 9 September 2023
Additional Information: Funding Information: Author Qianqian Qi is supported by the China Scholarship Council (CSC202007720017). Publisher Copyright: © 2023, The Author(s).
Keywords: Information retrieval, Initial dimensions, Singular value decomposition, Singular value weighting exponent

Identifiers

Local EPrints ID: 482278
URI: http://eprints.soton.ac.uk/id/eprint/482278
PURE UUID: a5975228-d185-4a25-94f9-a0348efeebba
ORCID for Peter Van Der Heijden: ORCID iD orcid.org/0000-0002-3345-096X

Catalogue record

Date deposited: 25 Sep 2023 16:43
Last modified: 18 Mar 2024 03:25

Export record

Altmetrics

Contributors

Author: Qianqian Qi
Author: David J. Hessen

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×