ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

About

Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.

Zhao Wang, Ziliang Zhao, Zhicheng Dou• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	Bamboogle	EM45.6	120
Question Answering	HotpotQA	EM41.4	109
Question Answering	2WikiMultihopQA	EM46	107
Question Answering	PopQA	EM47.2	88
Question Answering	MuSiQue	EM23.5	84
Question Answering	Average (PopQA, HotpotQA, 2Wiki, MuSiQue, Bamboogle)	EM40.7	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord