Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Jasper and Stella: distillation of SOTA embedding models

About

A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving an average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub (https://huggingface.co/infgrad/jasper_en_vision_language_v1) (https://huggingface.co/datasets/infgrad/jasper_text_distill_dataset), and the training codes are available in this project repository (https://github.com/NLPJCL/RAG-Retrieval).

Dun Zhang, Jiacheng Li, Ziyang Zeng, Fulong Wang• 2024

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringMuSiQue
EM23.21
209
Long-context Question AnsweringLongBench (test)
HotpotQA35.45
69
Triplet AccuracyDeliberation Evaluation Suite GSC, Remesh, Polis (test)
AbG43.5
26
Faithfulness EvaluationLongBench
NAR Score72.46
18
Multi-hop Question Answering2WikiMultihopQA
EM45.75
16
Medical Text EmbeddingCMedTEB
MAP@10 (CMed v1)87.16
13
Column matchingCIUS
Recall@1085.5
10
Long-text Question AnsweringUltraDomain
F1 (bio)33.85
10
Column matchingCancerKG
Recall@1079.8
10
Column matchingSAUS
Recall@1085.2
10
Showing 10 of 26 rows

Other info

Follow for update