A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

About

Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model's input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at https://github.com/Ayanami0730/arag.

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, Zhendong Mao• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	MuSiQue	LLM Accuracy74.1	34
Long-form Question Answering	GraphRAG-Bench Med	LLM Accuracy93.1	20
Long-form Question Answering	Novel GraphRAG-Bench	LLM-Acc85.3	20
Question Answering	HotpotQA	LLM Accuracy94.5	20
Question Answering	2WikiMultihopQA	LLM-Acc89.7	20
Legal Article Retrieval	LexRAG (test)	Recall@111.36	18
Legal Article Retrieval	STARD (test)	Recall@120.88	18
Legal Article Retrieval	StatuteRAG (test)	Recall@138.64	18
Question Answering	SpecsQA (test)	F1 (Factual Correctness)5.7	13
Question Answering	SpecsQA	FC F15.7	13

Showing 10 of 15 rows

Other info

GitHub

Follow for update

@wizwand_team Discord