Annotation-Efficient Universal Honesty Alignment
About
Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Honesty Alignment | Natural Questions (NQ) In-Domain | AUROC85.16 | 33 | |
| Honesty Alignment | HonestyBench In-Domain | NQ Score85.16 | 13 | |
| Honesty Alignment | HonestyBench OOD | Squad Score81.04 | 13 | |
| Question Answering Calibration | In-Domain Evaluation NQ, TQ, HQ, 2Wiki, Pararel | Calibration Error (NQ)0.05 | 11 | |
| Question Answering Calibration | OOD Evaluation (Squad, WQ, CWQ, MSQ, PopQA) | Squad Calibration Score10 | 11 | |
| Honesty Alignment | In-Domain Aggregate Average | AUROC84.36 | 3 | |
| Honesty Alignment | OOD Aggregate Average | AUROC0.8447 | 3 |