FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering
About
Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains, often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously to optimize both final answer accuracy and efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10x more data, our method achieves competitive performance with only approximately 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL not to increase reasoning steps, but to reduce them, as an effective solution for scalable and efficient RAG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultihopQA | -- | 387 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (dev) | -- | 38 | |
| Web-based Question Answering | BrowseComp+ | Accuracy21.53 | 22 | |
| Multi-hop Question Answering | 2Wiki | MBE51.2 | 17 | |
| Multi-hop Question Answering | HotpotQA | MBE58 | 17 | |
| Multi-hop Retrieval | HotpotQA | Recall70.4 | 6 | |
| Multi-hop Question Answering | MuSiQue | Recall52.6 | 6 | |
| Multi-hop Question Answering | 2WikiMultiHopQA full-wiki (dev) | MBE47.6 | 6 | |
| Multi-hop Question Answering | MuSiQue full-wiki (dev) | MBE30.1 | 6 |