Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

About

With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.

Alex Laitenberger, Christopher D. Manning, Nelson F. Liu• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringLongBench Qasper
F10.2635
62
Question Answering2WikiMultihopQA LongBench
F1 Score41.1
32
Question AnsweringNarrativeQA LongBench
F1 Score13.98
24
Showing 3 of 3 rows

Other info

Follow for update