Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

About

Reinforcement learning has emerged as an effective paradigm for training large language models to interleave reasoning with search engine calls. However, existing approaches face a fundamental credit assignment problem: methods like Search-R1 assign a single outcome reward to the entire multi-step trajectory, providing no signal about which reasoning or retrieval decisions were responsible for success or failure. Process-reward methods such as StepSearch introduce step-level supervision but still sample complete trajectories independently, so advantage estimates at any given step are contaminated by the randomness of all other steps. We propose SLATE (Step-Level Advantage estimation for Truncated Exploration), which addresses both problems through two complementary ideas. First, truncated step-level sampling generates k continuations from a shared prefix, isolating all variation to a single decision point. We prove this reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, the first formal variance guarantee for step-level RL in retrieval-augmented reasoning. Second, dense, decomposed process rewards separately evaluate reasoning quality, query quality, and answer correctness on a ternary scale via an LLM judge, providing richer supervision than binary outcome signals or heuristic step-level scores. Experiments on seven QA benchmarks show that SLATE consistently outperforms both sparse-reward and process-reward baselines, achieving a 7.0% relative improvement over Search-R1 on the 7B model and 30.7% on the 3B model. Gains are largest on challenging multi-hop tasks, and ablations confirm that truncated sampling and dense rewards provide complementary benefits.

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	2Wiki	EM36.8	241
Question Answering	Bamboogle	EM49.4	227
Question Answering	TriviaQA	EM65.2	182
Question Answering	PopQA	Exact Match47	133
Question Answering	2WikiMultihopQA	EM41.3	107
Question Answering	NQ	Exact Match49.7	101
Question Answering	MuSiQue	EM24.7	62

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord