Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
About
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | LongBench Qasper | F10.2635 | 62 | |
| Question Answering | 2WikiMultihopQA LongBench | F1 Score41.1 | 32 | |
| Question Answering | NarrativeQA LongBench | F1 Score13.98 | 24 |