Length-Induced Embedding Collapse in PLM-based Models
About
Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reranking | MTEB Reranking (test) | -- | 11 | |
| Classification | MTEB English Classification subsets | Average Classification Score65.12 | 10 | |
| Retrieval | MTEB English BEIR Retrieval | Average Score38.46 | 10 | |
| Retrieval | LongEmbd Retrieval | Main Score56.88 | 10 | |
| Summarization | MTEB English Summarization | Main Score32.17 | 10 | |
| Clustering | MTEB English Clustering | Average Score45.79 | 10 | |
| Semantic Textual Similarity | MTEB English STS subsets | Average STS Score75.68 | 10 |