Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Annotation-Efficient Universal Honesty Alignment

About

Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng• 2025

Related benchmarks

TaskDatasetResultRank
Honesty AlignmentNatural Questions (NQ) In-Domain
AUROC85.16
33
Honesty AlignmentHonestyBench In-Domain
NQ Score85.16
13
Honesty AlignmentHonestyBench OOD
Squad Score81.04
13
Question Answering CalibrationIn-Domain Evaluation NQ, TQ, HQ, 2Wiki, Pararel
Calibration Error (NQ)0.05
11
Question Answering CalibrationOOD Evaluation (Squad, WQ, CWQ, MSQ, PopQA)
Squad Calibration Score10
11
Honesty AlignmentIn-Domain Aggregate Average
AUROC84.36
3
Honesty AlignmentOOD Aggregate Average
AUROC0.8447
3
Showing 7 of 7 rows

Other info

Follow for update