Improving information retrieval through correspondence analysis instead of latent semantic analysis
Improving information retrieval through correspondence analysis instead of latent semantic analysis
The initial dimensions extracted by latent semantic analysis (LSA) of a document-term matrix have been shown to mainly display marginal effects, which are irrelevant for information retrieval. To improve the performance of LSA, usually the elements of the raw document-term matrix are weighted and the weighting exponent of singular values can be adjusted. An alternative information retrieval technique that ignores the marginal effects is correspondence analysis (CA). In this paper, the information retrieval performance of LSA and CA is empirically compared. Moreover, it is explored whether the two weightings also improve the performance of CA. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent often improves the performance of CA; however, the extent of the improvement depends on the dataset and the number of dimensions.
Information retrieval, Initial dimensions, Singular value decomposition, Singular value weighting exponent
Qi, Qianqian
a18a747f-c35a-4d27-b26d-1a064048dbc9
Hessen, David J.
f542c457-7a9c-412e-b5df-840b6fd0acd0
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
9 September 2023
Qi, Qianqian
a18a747f-c35a-4d27-b26d-1a064048dbc9
Hessen, David J.
f542c457-7a9c-412e-b5df-840b6fd0acd0
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
Qi, Qianqian, Hessen, David J. and Van Der Heijden, Peter
(2023)
Improving information retrieval through correspondence analysis instead of latent semantic analysis.
Journal of Intelligent Information Systems.
(doi:10.1007/s10844-023-00815-y).
Abstract
The initial dimensions extracted by latent semantic analysis (LSA) of a document-term matrix have been shown to mainly display marginal effects, which are irrelevant for information retrieval. To improve the performance of LSA, usually the elements of the raw document-term matrix are weighted and the weighting exponent of singular values can be adjusted. An alternative information retrieval technique that ignores the marginal effects is correspondence analysis (CA). In this paper, the information retrieval performance of LSA and CA is empirically compared. Moreover, it is explored whether the two weightings also improve the performance of CA. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent often improves the performance of CA; however, the extent of the improvement depends on the dataset and the number of dimensions.
Text
Qi et al. (2023) Journal_of_Intelligent_Information_Systems_Improving_information_retrieval_through_correspondence_analysis_instead_of_latent_semantic_analysis (1)
- Accepted Manuscript
Text
s10844-023-00815-y
- Version of Record
More information
Accepted/In Press date: 25 August 2023
e-pub ahead of print date: 9 September 2023
Published date: 9 September 2023
Additional Information:
Funding Information:
Author Qianqian Qi is supported by the China Scholarship Council (CSC202007720017).
Publisher Copyright:
© 2023, The Author(s).
Keywords:
Information retrieval, Initial dimensions, Singular value decomposition, Singular value weighting exponent
Identifiers
Local EPrints ID: 482278
URI: http://eprints.soton.ac.uk/id/eprint/482278
PURE UUID: a5975228-d185-4a25-94f9-a0348efeebba
Catalogue record
Date deposited: 25 Sep 2023 16:43
Last modified: 11 Nov 2023 02:46
Export record
Altmetrics
Contributors
Author:
Qianqian Qi
Author:
David J. Hessen
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics