Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study

About

Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. We further introduce variants of CA for word-context matrices, namely CA applied after a square-root transformation (ROOT-CA) and after a fourth-root transformation (ROOTROOT-CA). We analyze the performance of these methods and examine how their success or failure is influenced by extreme values in the decomposed matrix. Although our primary focus is on traditionalstatic word embedding methods, we also include a comparison with a transformer-based encoder (BERT) to situate the results relative to contextual embeddings. Empirical evaluations across multiple corpora and word-similarity benchmarks show that ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods and achieve results competitive with BERT.

Qianqian Qi, Ayoub Bagheri, David J. Hessen, Peter G. M. van der Heijden• 2024

Related benchmarks

Task	Dataset	Result
Word Similarity	WordSim-353	Spearman Rho0.7	114
Word Similarity	MEN	Spearman Rho0.4	74
Word Similarity	SimLex-999	Spearman Correlation44.5	31
Word Similarity	Mechanical Turk-771	Spearman ρ0.669	14
Word Similarity	Total Aggregate of WordSim353, MEN, Turk, SimLex-999	Sum of Spearman rho2.172	6
Word Similarity	WordSim353 Wiki052024 corpus	--	4
Word Similarity	MEN Wiki052024	--	4
Word Similarity	Turk Wiki052024	--	4
Word Similarity	SimLex-999 Wiki052024 corpus	--	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord