Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

About

Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.

Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang• 2026

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	HotpotQA (test)	F133.8	311
Multi-hop Question Answering	2WikiMultiHopQA (test)	EM27.2	226
Multi-hop Question Answering	MuSiQue (test)	F112.6	128
Multi-hop Question Answering	Bamboogle (test)	EM22	98
Multi-hop Question Answering	Antileak-m (test)	EM36.3	12

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord