Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FaithRL: Learning to Reason Faithfully through Step-Level Faithfulness Maximization

About

Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.

Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultiHopQA Full
Accuracy (C)87.5
22
Multi-hop Question AnsweringHotpotQA Full
C (Correctness)86.1
22
Multi-hop Question AnsweringMuSiQue Full
C Score80
22
Logical reasoningLogiQA
Accuracy (LogiQA)68.2
12
Graduate-Level ReasoningGPQA D
Accuracy44.4
12
ReasoningOverall
Overall Accuracy69.8
12
ReasoningAMC23
Accuracy90
12
Multi-hop Reasoning2WikiMultihopQA
Accuracy76.6
12
Showing 8 of 8 rows

Other info

Follow for update