Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

About

Reinforcement learning has emerged as an effective paradigm for training large language models to interleave reasoning with search engine calls. However, existing approaches face a fundamental credit assignment problem: methods like Search-R1 assign a single outcome reward to the entire multi-step trajectory, providing no signal about which reasoning or retrieval decisions were responsible for success or failure. Process-reward methods such as StepSearch introduce step-level supervision but still sample complete trajectories independently, so advantage estimates at any given step are contaminated by the randomness of all other steps. We propose SLATE (Step-Level Advantage estimation for Truncated Exploration), which addresses both problems through two complementary ideas. First, truncated step-level sampling generates k continuations from a shared prefix, isolating all variation to a single decision point. We prove this reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, the first formal variance guarantee for step-level RL in retrieval-augmented reasoning. Second, dense, decomposed process rewards separately evaluate reasoning quality, query quality, and answer correctness on a ternary scale via an LLM judge, providing richer supervision than binary outcome signals or heuristic step-level scores. Experiments on seven QA benchmarks show that SLATE consistently outperforms both sparse-reward and process-reward baselines, achieving a 7.0% relative improvement over Search-R1 on the 7B model and 30.7% on the 3B model. Gains are largest on challenging multi-hop tasks, and ablations confirm that truncated sampling and dense rewards provide complementary benefits.

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringTriviaQA
EM65.2
182
Question Answering2Wiki--
152
Question AnsweringBamboogle
EM49.4
120
Question Answering2WikiMultihopQA
EM41.3
107
Question AnsweringMuSiQue
EM24.7
50
Question AnsweringNQ
Exact Match49.7
46
Question AnsweringPopQA
Exact Match47
25
Showing 7 of 7 rows

Other info

GitHub

Follow for update