Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

About

Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5\%} higher accuracy, with retrieval gains of up to \textbf{10.5\%} in NDCG.

Xingyu Li, Rongguang Wang, Yuying Wang, Mengqing Guo, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2Wiki--
152
Multi-hop Question AnsweringMoreHopQA
Accuracy86.4
25
Multi-hop Question AnsweringFRAMES
Accuracy85.8
22
Multi-hop QA RetrievalMuSiQue
NDCG63.6
5
Multi-hop QA RetrievalMoreHopQA
NDCG0.908
5
Multi-hop QA RetrievalFRAMES
NDCG0.834
5
Multi-hop QA RetrievalAverage (MuSiQue, 2Wiki, MoreHopQA, FRAMES)
NDCG0.791
5
Showing 7 of 7 rows

Other info

Follow for update