Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study

About

Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. We further introduce variants of CA for word-context matrices, namely CA applied after a square-root transformation (ROOT-CA) and after a fourth-root transformation (ROOTROOT-CA). We analyze the performance of these methods and examine how their success or failure is influenced by extreme values in the decomposed matrix. Although our primary focus is on traditionalstatic word embedding methods, we also include a comparison with a transformer-based encoder (BERT) to situate the results relative to contextual embeddings. Empirical evaluations across multiple corpora and word-similarity benchmarks show that ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods and achieve results competitive with BERT.

Qianqian Qi, Ayoub Bagheri, David J. Hessen, Peter G. M. van der Heijden• 2024

Related benchmarks

TaskDatasetResultRank
Word SimilarityWordSim-353
Spearman Rho0.7
114
Word SimilarityMEN
Spearman Rho0.4
68
Word SimilaritySimLex-999
Spearman Correlation44.5
31
Word SimilarityMechanical Turk-771
Spearman ρ0.669
8
Word SimilarityTotal Aggregate of WordSim353, MEN, Turk, SimLex-999
Sum of Spearman rho2.172
6
Word SimilarityWordSim353 Wiki052024 corpus--
4
Word SimilarityMEN Wiki052024--
4
Word SimilarityTurk Wiki052024--
4
Word SimilaritySimLex-999 Wiki052024 corpus--
4
Showing 9 of 9 rows

Other info

Follow for update