Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

About

Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems--all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.

Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min• 2025

Related benchmarks

Task	Dataset	Result
Math	MATH 500	Accuracy89.9	120
Scientific Reasoning	GPQA Diamond (test)	Accuracy85.9	82
Mathematics	AIME 2025	Accuracy38.8	66
Mathematics	AIME 2024	Accuracy60.4	60
Mathematical Reasoning	AIME 2025-2026 (test)	Accuracy91.7	42
Programming	LiveCodeBench V3 V4 (test)	Accuracy60.9	42
Science	GPQA D	Accuracy60.2	33
Code	LiveCodeBench V1-4	Accuracy29.2	33
Code	LiveCodeBench V5-6	Accuracy28.4	33

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord