OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

About

Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.

Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis• 2026

Related benchmarks

Task	Dataset	Result
Dense Retrieval	NFCorpus	NDCG@1049.1	5
Dense Retrieval	TripClick h	NDCG@1030.9	5
Dense Retrieval	TripClick (t)	NDCG@1024.9	5
Dense Retrieval	FiQA	NDCG@100.524	5
Dense Retrieval	ANTIQUE	NDCG@100.59	5
Dense Retrieval	TriviaQA	NDCG@1050.1	5
Dense Retrieval	HotpotQA	NDCG@100.812	5
Dense Retrieval	FEVER	NDCG@100.915	5
Information Retrieval	ANTIQUE (test)	NDCG@1054	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord