Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

About

Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Leqi Liu, Amy Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy77.6
535
Truthfulness EvaluationTruthfulQA
Accuracy55.7
93
Factuality EvaluationTruthfulQA--
40
Model CalibrationMATH, GSM8K, SelfAware, and TruthfulQA combined
ECE0.086
10
FactualitySelfAware
Score0.355
10
Self-awarenessSelfAware
Accuracy50.2
10
CalibrationCalibration Evaluation Set
ECE0.132
10
Showing 7 of 7 rows

Other info

Follow for update