Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

About

Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.

Zhao Wang, Ziliang Zhao, Zhicheng Dou• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringMuSiQue
EM23.5
84
Question AnsweringPopQA
EM47.2
80
Question AnsweringHotpotQA
EM41.4
79
Question Answering2WikiMultihopQA
EM46
73
Question AnsweringBamboogle
EM45.6
62
Question AnsweringAverage (PopQA, HotpotQA, 2Wiki, MuSiQue, Bamboogle)
EM40.7
10
Showing 6 of 6 rows

Other info

Follow for update