Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

About

Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems--all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.

Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min• 2025

Related benchmarks

TaskDatasetResultRank
MathMATH 500
Accuracy89.9
86
MathematicsAIME 2025
Accuracy38.8
66
MathematicsAIME 2024
Accuracy60.4
60
ScienceGPQA D
Accuracy60.2
33
CodeLiveCodeBench V1-4
Accuracy29.2
33
CodeLiveCodeBench V5-6
Accuracy28.4
33
Showing 6 of 6 rows

Other info

Follow for update