DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

About

This paper shows how diffusion language models (DLMs) can be used as effective and efficient retrievers. Existing DLM-based retrievers (e.g., DiffEmbed) follow BERT-style encoding, representing each query or passage as a single mean-pooled vector. This ignores how DLMs are trained to generate responses through masked-position prediction under bidirectional attention, a capability that can provide stronger retrieval signals. We propose DiffRetriever, which uses the DLM's native masked-position prediction directly for retrieval. For each query or passage, DiffRetriever appends one or more masked positions, using the outputs as retrieval representations in a single forward pass. With one masked position, single-representation DiffRetriever already improves over DiffEmbed on the same backbones. DiffRetriever also naturally extends to multi-representation retrieval: DLMs process multiple masked positions jointly, enabling ColBERT-style fine-grained matching with little additional encoding latency. In autoregressive LLM retrievers, the same multi-representation strategy requires sequential decoding and therefore incurs much higher latency. DiffRetriever obtains the strongest aggregate effectiveness within our matched comparison, outperforming DiffEmbed, PromptReps, and RepLLaMA. Masked-position counts selected on training data transfer well across datasets, while per-query variation suggests headroom for adaptive allocation. Code is available at https://github.com/ielab/diffretriever.

Shuai Wang, Yu Yin, Shengyao Zhuang, Bevan Koopman, Guido Zuccon• 2026

Related benchmarks

Task	Dataset	Result
Retrieval	TREC DL 2019	--	83
Information Retrieval	COVID	--	50
Information Retrieval	TREC DL 2020	--	43
Information Retrieval	NQ	NDCG@10 (Dense)64.4	21
Information Retrieval	FiQA	NDCG@10 (Dense)0.479	21
Information Retrieval	Quora	NDCG@10 (Dense)88.7	21
Information Retrieval	BEIR-7 Average	NDCG@10 (Dense)67.1	21
Retrieval	MS Marco	D (MRR@10)0.433	21
Information Retrieval	SciFact	NDCG@10 (Dense)75.2	21
Information Retrieval	ArguAna	NDCG@10 (Dense)41.4	21

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord