LLM2Vec-Gen: Generative Embeddings from Large Language Models

About

Fine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM's output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM's output space by learning to represent the model's potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM's own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM's response-space semantics, they inherit capabilities such as safety alignment (up to 22.6% reduction in harmful content retrieval) and reasoning (up to 35.6% improvement on reasoning-intensive retrieval). Finally, the learned embeddings are also interpretable: they can be decoded back into text to reveal their semantic content.

Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy• 2026

Related benchmarks

Task	Dataset	Result
Text Embedding Evaluation	MTEB eng v2 (test)	Retrieval Score42.2	18
Reasoning-intensive Retrieval	BRIGHT	BRIGHT Score19.3	8
Safety Retrieval	AdvBench-IR	AdvBench-IR Score44.4	8

Showing 3 of 3 rows

Other info

GitHub

Follow for update

@wizwand_team Discord