Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

About

With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.

Alex Laitenberger, Christopher D. Manning, Nelson F. Liu• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	LongBench Qasper	F10.2635	62
Question Answering	2WikiMultihopQA LongBench	F1 Score41.1	32
Question Answering	NarrativeQA LongBench	F1 Score13.98	29

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord