QA-Calibration of Language Model Confidence Scores
About
To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Calibration | MMLU | -- | 58 | |
| Calibration | TruthfulQA | Gain1.984 | 32 | |
| Calibration | MathQA | ECE' Gain0.145 | 8 | |
| Calibration | OpenBookQA | ECE Gain0.605 | 8 | |
| Question Answering | SciQ | ECE' Gain0.776 | 8 | |
| Question Answering | TriviaQA | ECE' Gain2.262 | 8 | |
| Summarization | FeedSum (test) | ECE (Instance)0.022 | 5 | |
| Summarization quality evaluation | UniSumEval | ECE'0.99 | 3 |