Annotation-Efficient Universal Honesty Alignment

About

Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng• 2025

Related benchmarks

Task	Dataset	Result
Honesty Alignment	Natural Questions (NQ) In-Domain	AUROC85.16	33
Honesty Alignment	HonestyBench In-Domain	NQ Score85.16	13
Honesty Alignment	HonestyBench OOD	Squad Score81.04	13
Question Answering Calibration	In-Domain Evaluation NQ, TQ, HQ, 2Wiki, Pararel	Calibration Error (NQ)0.05	11
Question Answering Calibration	OOD Evaluation (Squad, WQ, CWQ, MSQ, PopQA)	Squad Calibration Score10	11
Honesty Alignment	In-Domain Aggregate Average	AUROC84.36	3
Honesty Alignment	OOD Aggregate Average	AUROC0.8447	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord