CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

About

Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Leqi Liu, Amy Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy77.6	535
Truthfulness Evaluation	TruthfulQA	Accuracy55.7	108
Factuality Evaluation	TruthfulQA	--	103
Model Calibration	MATH, GSM8K, SelfAware, and TruthfulQA combined	ECE0.086	10
Factuality	SelfAware	Score0.355	10
Self-awareness	SelfAware	Accuracy50.2	10
Calibration	Calibration Evaluation Set	ECE0.132	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord