LUQ: Long-text Uncertainty Quantification for LLMs
About
Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq} and its two variations, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose \textsc{Luq-Ensemble}, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Uncertainty Quantification | Aggregated Experimental Datasets (XSum, SamSum, CNN, WMT19, MedQUAD, TruthfulQA, CoQA, SciQ, TriviaQA, MMLU, GSM8k) (test) | Mean Rank12.18 | 88 | |
| Question Answering | MedQUAD | PRR9.6 | 66 | |
| Selective Generation | TriviaQA | ROC-AUC84.4 | 66 | |
| Selective Generation | CoQA | ROC-AUC69.9 | 66 | |
| Question Answering | SciQ | PRR44.9 | 66 | |
| Selective Generation | TruthfulQA | ROC-AUC0.663 | 66 | |
| Summarization | SamSum | PRR0.17 | 66 | |
| Selective Generation | SamSum | ROC-AUC65.6 | 66 | |
| Selective Generation | Xsum | ROC-AUC66.8 | 66 | |
| Selective Generation | cnn | ROC-AUC59.1 | 66 |