Search and Refine During Think: Facilitating Knowledge Refinement for Improved Retrieval-Augmented Reasoning
About
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new "search-and-refine-during-think" paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultihopQA | EM32.8 | 278 | |
| Multi-hop Question Answering | MuSiQue | EM16.9 | 106 | |
| Multi-hop Question Answering | Bamboogle | Exact Match32 | 97 | |
| Single-hop Question Answering | TriviaQA | EM58.7 | 62 | |
| Multi-hop Question Answering | HotpotQA | Exact Match (EM)38.2 | 56 | |
| Single-hop Question Answering | PopQA | EM44.9 | 55 | |
| Multi-hop Question Answering | BrowseComp-ZH | LJFT5.19 | 5 | |
| Multi-hop Question Answering | Web Dancer | LJFT39.09 | 5 | |
| Multi-hop Question Answering | MuSiQue | LJFT19.8 | 5 | |
| Multi-hop Question Answering | Average (BrowseComp-ZH, Bamboogle, MuSiQue, Web Dancer) (Overall) | LJFT Score28.02 | 5 |