Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure

About

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Leading uncertainty estimation methods generate and analyze multiple output sequences, which is computationally expensive and impractical at scale. In this work, we inspect the theoretical foundations of these methods and explore new directions to enhance computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically principled uncertainty measure. To approximate this alternative measure, we propose G-NLL, obtained using a single output sequence from greedy decoding. This approach streamlines uncertainty estimation while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various scenarios. Our work lays the theoretical foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of the prevalent methods that are more complex and resource-intensive.

Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter• 2024

Related benchmarks

Task	Dataset	Result
Correctness Prediction	TriviaQA	AUROC0.8591	113
Summarization	Summ.	Mean PRR0.507	109
Machine Translation	MT	Mean PRR44.7	109
Question Answering	QA	Mean PRR36.4	109
Hallucination Detection	QA	ROC-AUC74.6	64
Hallucination Detection	Summ.	ROC-AUC73.5	64
Hallucination Detection	MT	ROC-AUC68.8	64
Predicting answer correctness	TruthfulQA	AUROC0.6367	48
Generation correctness prediction	SciQ	AUROC76.83	42
Generation correctness prediction	TruthfulQA (test)	AURC59.08	42

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord