Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
About
A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA (test) | Accuracy63.4 | 134 | |
| Model Calibration | MACE | AUROC74.4 | 84 | |
| Confidence calibration | MACE (test) | AUROC66.9 | 84 | |
| LLM Calibration | MACE | ECE22.8 | 60 | |
| Calibration | NQ | ECE0.203 | 55 | |
| Calibration | MMLU | Brier Score0.2546 | 42 | |
| Calibration | TriviaQA | Brier Score0.2046 | 39 | |
| Multiple-choice Question Answering | TruthfulQA MC1 | MC1 Accuracy59.2 | 33 | |
| Calibration | SQuAD | ECE29.56 | 31 | |
| Calibration | WebQ | ECE0.2537 | 31 |