Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

About

Reinforcement learning has emerged as an effective paradigm for training large language models to interleave reasoning with search engine calls. However, existing approaches face a fundamental credit assignment problem: methods like Search-R1 assign a single outcome reward to the entire multi-step trajectory, providing no signal about which reasoning or retrieval decisions were responsible for success or failure. Process-reward methods such as StepSearch introduce step-level supervision but still sample complete trajectories independently, so advantage estimates at any given step are contaminated by the randomness of all other steps. We propose SLATE (Step-Level Advantage estimation for Truncated Exploration), which addresses both problems through two complementary ideas. First, truncated step-level sampling generates k continuations from a shared prefix, isolating all variation to a single decision point. We prove this reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, the first formal variance guarantee for step-level RL in retrieval-augmented reasoning. Second, dense, decomposed process rewards separately evaluate reasoning quality, query quality, and answer correctness on a ternary scale via an LLM judge, providing richer supervision than binary outcome signals or heuristic step-level scores. Experiments on seven QA benchmarks show that SLATE consistently outperforms both sparse-reward and process-reward baselines, achieving a 7.0% relative improvement over Search-R1 on the 7B model and 30.7% on the 3B model. Gains are largest on challenging multi-hop tasks, and ablations confirm that truncated sampling and dense rewards provide complementary benefits.

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani• 2026

Related benchmarks

TaskDatasetResultRank
Question Answering2Wiki
EM36.8
241
Question AnsweringBamboogle
EM49.4
227
Question AnsweringTriviaQA
EM65.2
182
Question AnsweringPopQA
Exact Match47
133
Question Answering2WikiMultihopQA
EM41.3
107
Question AnsweringNQ
Exact Match49.7
101
Question AnsweringMuSiQue
EM24.7
62
Showing 7 of 7 rows

Other info

GitHub

Follow for update