Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Zipfian Whitening

About

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira• 2024

Related benchmarks

TaskDatasetResultRank
Semantic Textual SimilaritySTS tasks (STS12, STS13, STS14, STS15, STS16, STS-B, SICK-R) various (test)
STS12 Score61.22
393
Word SimilarityWS-353 (test)
Spearman Correlation0.8231
18
Semantic SimilaritySTS-B (test)
Semantic Consistency66.92
18
Semantic Textual SimilarityJSTS (test)
JSTS Score65.56
7
Lexical SimilarityMen (test)
Spearman Correlation0.8435
5
Showing 5 of 5 rows

Other info

Code

Follow for update