LLM2Vec-Gen: Generative Embeddings from Large Language Models
About
Fine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM's output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM's output space by learning to represent the model's potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM's own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM's response-space semantics, they inherit capabilities such as safety alignment (up to 22.6% reduction in harmful content retrieval) and reasoning (up to 35.6% improvement on reasoning-intensive retrieval). Finally, the learned embeddings are also interpretable: they can be decoded back into text to reveal their semantic content.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Embedding Evaluation | MTEB eng v2 (test) | Retrieval Score42.2 | 18 | |
| Reasoning-intensive Retrieval | BRIGHT | BRIGHT Score19.3 | 8 | |
| Safety Retrieval | AdvBench-IR | AdvBench-IR Score44.4 | 8 |