Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness

About

We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model by estimating a numeric confidence score for any output it generated. Our uncertainty quantification technique works for any LLM accessible only via a black-box API, whose training data remains unknown. By expending a bit of extra computation, users of any LLM API can now get the same response as they would ordinarily, as well as a confidence estimate that cautions when not to trust this response. Experiments on both closed and open-form Question-Answer benchmarks reveal that BSDetector more accurately identifies incorrect LLM responses than alternative uncertainty estimation procedures (for both GPT-3 and ChatGPT). By sampling multiple responses from the LLM and considering the one with the highest confidence score, we can additionally obtain more accurate responses from the same LLM, without any extra training steps. In applications involving automated evaluation with LLMs, accounting for our confidence scores leads to more reliable evaluation in both human-in-the-loop and fully-automated settings (across both GPT 3.5 and 4).

Jiuhai Chen, Jonas Mueller• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy69.44	1398
Mathematical Reasoning	SVAMP	Accuracy82	403
Commonsense Reasoning	CSQA	Accuracy73.22	366
Question Answering	TriviaQA	Accuracy76	238
Uncertainty Estimation	TriviaQA	AUROC82.8	111
Uncertainty Estimation	GSM8K	AUROC0.951	41
Uncertainty Estimation	SVAMP	--	8
Uncertainty Estimation	CSQA	AUROC0.769	7
Confidence Estimation	MediTOD	AUROC62.8	7
Confidence Estimation	DDXPlus	AUROC0.652	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord