Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study
About
Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. We further introduce variants of CA for word-context matrices, namely CA applied after a square-root transformation (ROOT-CA) and after a fourth-root transformation (ROOTROOT-CA). We analyze the performance of these methods and examine how their success or failure is influenced by extreme values in the decomposed matrix. Although our primary focus is on traditionalstatic word embedding methods, we also include a comparison with a transformer-based encoder (BERT) to situate the results relative to contextual embeddings. Empirical evaluations across multiple corpora and word-similarity benchmarks show that ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods and achieve results competitive with BERT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Word Similarity | WordSim-353 | Spearman Rho0.7 | 114 | |
| Word Similarity | MEN | Spearman Rho0.4 | 68 | |
| Word Similarity | SimLex-999 | Spearman Correlation44.5 | 31 | |
| Word Similarity | Mechanical Turk-771 | Spearman ρ0.669 | 8 | |
| Word Similarity | Total Aggregate of WordSim353, MEN, Turk, SimLex-999 | Sum of Spearman rho2.172 | 6 | |
| Word Similarity | WordSim353 Wiki052024 corpus | -- | 4 | |
| Word Similarity | MEN Wiki052024 | -- | 4 | |
| Word Similarity | Turk Wiki052024 | -- | 4 | |
| Word Similarity | SimLex-999 Wiki052024 corpus | -- | 4 |