Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

About

Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.

Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng• 2025

Related benchmarks

TaskDatasetResultRank
Information RetrievalBEIR (test)--
126
Information RetrievalBEIR
SciFact0.9042
120
Information RetrievalFIQA BEIR (test)
nDCG@1021.55
44
Information RetrievalArguana BEIR
NDCG@1056.41
33
Information RetrievalSciFact BEIR
NDCG@1092.97
24
Information RetrievalQuora BEIR
nDCG@1084.02
22
Vision-Language Diagnostic EvaluationCOCO Flickr30K pool
Caption R@141.11
11
Information RetrievalNFCorpus Full BEIR
nDCG@1017.15
11
Information RetrievalBEIR scidocs (test)
nDCG@100.0226
10
Information RetrievalScidocs BEIR--
6
Showing 10 of 14 rows

Other info

Follow for update