The University of Southampton
University of Southampton Institutional Repository

A comparison of correspondence analysis with PMI-based word embedding methods

A comparison of correspondence analysis with PMI-based word embedding methods
A comparison of correspondence analysis with PMI-based word embedding methods
Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we link correspondence analysis (CA) to the factorization of the PMI matrix. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. In addition, we present variants of CA that turn out to be successful in the factorization of the word-context matrix, i.e. CA applied to a matrix where the entries undergo a square-root transformation (ROOT-CA) and a root-root transformation (ROOTROOT-CA). An empirical comparison among CA- and PMI-based methods shows that overall results of ROOT-CA and ROOTROOT-CA are slightly better than those of the PMI-based methods.
cs.CL
Qi, Qianqian
47673ec0-7ef7-413d-8102-10789990f40c
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
van der Heijden, Peter G.M.
85157917-3b33-4683-81be-713f987fd612
Qi, Qianqian
47673ec0-7ef7-413d-8102-10789990f40c
Hessen, David J.
5e4ddabd-0df6-48e4-8c6e-478e2f1940ec
van der Heijden, Peter G.M.
85157917-3b33-4683-81be-713f987fd612

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we link correspondence analysis (CA) to the factorization of the PMI matrix. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. In addition, we present variants of CA that turn out to be successful in the factorization of the word-context matrix, i.e. CA applied to a matrix where the entries undergo a square-root transformation (ROOT-CA) and a root-root transformation (ROOTROOT-CA). An empirical comparison among CA- and PMI-based methods shows that overall results of ROOT-CA and ROOTROOT-CA are slightly better than those of the PMI-based methods.

Text
2405.20895v1 - Author's Original
Available under License Other.
Download (1MB)

More information

e-pub ahead of print date: 31 May 2024
Published date: 31 May 2024
Keywords: cs.CL

Identifiers

Local EPrints ID: 491179
URI: http://eprints.soton.ac.uk/id/eprint/491179
PURE UUID: 9e48958f-8041-4c6a-a54f-637c0306230a
ORCID for Peter G.M. van der Heijden: ORCID iD orcid.org/0000-0002-3345-096X

Catalogue record

Date deposited: 14 Jun 2024 16:40
Last modified: 16 Jul 2024 01:44

Export record

Altmetrics

Contributors

Author: Qianqian Qi
Author: David J. Hessen

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×