Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

About

Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, Jitao Sang• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2Wiki
Exact Match52.1
152
Multi-hop Question AnsweringBamboogle
Exact Match56
128
Multi-hop Question AnsweringHotpotQA
Exact Match (EM)48.8
117
Single-hop Question AnsweringPopQA
EM43.6
104
Single-hop Question AnsweringTriviaQA
EM65.6
81
Multi-hop Question AnsweringMuSiQue
Exact Match (EM)25.3
58
Single-hop Question AnsweringNQ
Exact Match (EM)38.5
44
Showing 7 of 7 rows

Other info

Follow for update