Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure

About

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Leading uncertainty estimation methods generate and analyze multiple output sequences, which is computationally expensive and impractical at scale. In this work, we inspect the theoretical foundations of these methods and explore new directions to enhance computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically principled uncertainty measure. To approximate this alternative measure, we propose G-NLL, obtained using a single output sequence from greedy decoding. This approach streamlines uncertainty estimation while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various scenarios. Our work lays the theoretical foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of the prevalent methods that are more complex and resource-intensive.

Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter• 2024

Related benchmarks

TaskDatasetResultRank
Correctness PredictionTriviaQA
AUROC0.8591
113
Predicting answer correctnessTruthfulQA
AUROC0.6367
48
Generation correctness predictionSciQ
AUROC76.83
42
Generation correctness predictionTruthfulQA (test)
AURC59.08
42
Generation correctness predictionTriviaQA (test)
AURC30.54
42
Generation correctness predictionSciQ (test)
AURC23.03
42
Rejection Accuracy EvaluationAverage across all datasets (test)
G-NLL0.612
31
Uncertainty EstimationTriviaQA, SVAMP, and NQ Average
AUROC0.843
23
Mathematical ReasoningArithematics--
4
Question AnsweringGPQA--
4
Showing 10 of 11 rows

Other info

Follow for update