Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

About

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning• 2023

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	--	621
Medical Question Answering	MedMCQA (test)	Accuracy63.4	134
Model Calibration	MACE	AUROC74.4	84
Confidence calibration	MACE (test)	AUROC66.9	84
Hallucination Detection	MMLU	AUPRC67.61	62
Online Shopping	Webshop	--	61
Code Correctness Prediction	LiveCodeBench Python	Brier Score0.067	60
Predicting code correctness	LiveCodeBench Python	ECE0.06	60
Code Correctness Prediction	MultiPL-E Java	ECE0.22	60
Code Correctness Prediction	MultiPL-E Java	Brier Score0.282	60

Showing 10 of 101 rows

...

Other info

Follow for update

@wizwand_team Discord