Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

QA-Calibration of Language Model Confidence Scores

About

To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).

Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas• 2024

Related benchmarks

TaskDatasetResultRank
CalibrationMMLU--
58
CalibrationTruthfulQA
Gain1.984
32
CalibrationMathQA
ECE' Gain0.145
8
CalibrationOpenBookQA
ECE Gain0.605
8
Question AnsweringSciQ
ECE' Gain0.776
8
Question AnsweringTriviaQA
ECE' Gain2.262
8
SummarizationFeedSum (test)
ECE (Instance)0.022
5
Summarization quality evaluationUniSumEval
ECE'0.99
3
Showing 8 of 8 rows

Other info

Follow for update