Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

About

Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.

Junyi Li, Hwee Tou Ng• 2025

Related benchmarks

Task	Dataset	Result
Factuality	TruthfulQA	Accuracy10.26	145
Factual Knowledge Evaluation	PopQA	Accuracy23.54	56
Factual Question Answering	TriviaQA	Accuracy63.43	46
Factual QA	NQ-Open	Accuracy39.25	36
Factual QA	SimpleQA	Accuracy5.12	24
Multi-hop Question Answering	HotpotQA Full	C (Correctness)78.3	22
Multi-hop Question Answering	2WikiMultiHopQA Full	Accuracy (C)77.7	22
Multi-hop Question Answering	MuSiQue Full	C Score63.2	22
Question Answering	HotpotQA	Faithfulness91.83	21
Question Answering	TriviaQA	Faith85.93	21

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord