LUQ: Long-text Uncertainty Quantification for LLMs

About

Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq} and its two variations, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose \textsc{Luq-Ensemble}, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.

Caiqi Zhang, Fangyu Liu, Marco Basaldella, Nigel Collier• 2024

Related benchmarks

Task	Dataset	Result
Uncertainty Quantification	Aggregated Experimental Datasets (XSum, SamSum, CNN, WMT19, MedQUAD, TruthfulQA, CoQA, SciQ, TriviaQA, MMLU, GSM8k) (test)	Mean Rank12.18	88
Question Answering	MedQUAD	PRR9.6	66
Selective Generation	TriviaQA	ROC-AUC84.4	66
Selective Generation	CoQA	ROC-AUC69.9	66
Question Answering	SciQ	PRR44.9	66
Selective Generation	TruthfulQA	ROC-AUC0.663	66
Summarization	SamSum	PRR0.17	66
Selective Generation	SamSum	ROC-AUC65.6	66
Selective Generation	Xsum	ROC-AUC66.8	66
Selective Generation	cnn	ROC-AUC59.1	66

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord