Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

About

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning• 2023

Related benchmarks

TaskDatasetResultRank
Medical Question AnsweringMedMCQA (test)
Accuracy63.4
134
Model CalibrationMACE
AUROC74.4
84
Confidence calibrationMACE (test)
AUROC66.9
84
LLM CalibrationMACE
ECE22.8
60
CalibrationNQ
ECE0.203
55
CalibrationMMLU
Brier Score0.2546
42
CalibrationTriviaQA
Brier Score0.2046
39
Multiple-choice Question AnsweringTruthfulQA MC1
MC1 Accuracy59.2
33
CalibrationSQuAD
ECE29.56
31
CalibrationWebQ
ECE0.2537
31
Showing 10 of 46 rows

Other info

Follow for update