Length-Induced Embedding Collapse in PLM-based Models

About

Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.

Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu• 2024

Related benchmarks

Task	Dataset	Result
Reranking	MTEB Reranking (test)	--	11
Classification	MTEB English Classification subsets	Average Classification Score65.12	10
Retrieval	MTEB English BEIR Retrieval	Average Score38.46	10
Retrieval	LongEmbd Retrieval	Main Score56.88	10
Summarization	MTEB English Summarization	Main Score32.17	10
Clustering	MTEB English Clustering	Average Score45.79	10
Semantic Textual Similarity	MTEB English STS subsets	Average STS Score75.68	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord