Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Length-Induced Embedding Collapse in PLM-based Models

About

Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.

Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu• 2024

Related benchmarks

TaskDatasetResultRank
RerankingMTEB Reranking (test)--
11
ClassificationMTEB English Classification subsets
Average Classification Score65.12
10
RetrievalMTEB English BEIR Retrieval
Average Score38.46
10
RetrievalLongEmbd Retrieval
Main Score56.88
10
SummarizationMTEB English Summarization
Main Score32.17
10
ClusteringMTEB English Clustering
Average Score45.79
10
Semantic Textual SimilarityMTEB English STS subsets
Average STS Score75.68
10
Showing 7 of 7 rows

Other info

Follow for update