Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

About

Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for *black-box* LLMs. We first differentiate *uncertainty* vs *confidence*: the former refers to the ``dispersion'' of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at https://github.com/zlin7/UQ-NLG.

Zhen Lin, Shubhendu Trivedi, Jimeng Sun• 2023

Related benchmarks

TaskDatasetResultRank
Hallucination DetectionTriviaQA
AUROC0.7102
265
Hallucination DetectionTriviaQA (test)
AUC-ROC71.02
169
Uncertainty QuantificationAverage of 6 datasets
PRR43.7
120
Hallucination DetectionHotpotQA
AUROC0.55
118
Hallucination DetectionRAGTruth (test)
AUROC0.6958
83
Question Answering5 QA tasks
Accuracy54.02
78
Hallucination DetectionMATH
Mean AUROC70
72
Uncertainty QuantificationPopQA 500 randomly sampled queries (test)
AUROC0.8198
70
Uncertainty QuantificationMusique 500 randomly sampled queries (test)
AUROC0.7255
70
Uncertainty QuantificationHotpotQA 500 randomly sampled queries (test)
AUROC69.95
70
Showing 10 of 37 rows

Other info

Follow for update