Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LM-Polygraph: Uncertainty Estimation for Language Models

About

Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov• 2023

Related benchmarks

TaskDatasetResultRank
Medical LLM Risk TriageRETINA-SAFE Stage-1
Unsafe Recall99.82
60
Misclassification DetectionCOLA
ROC-AUC77
31
Question AnsweringTriviaQA
ECE0.104
28
Selective PredictionWMT de 14
Prediction Ranking Rate34.8
20
Selective PredictionWMT fr 14
Prediction Ranking Rate39.1
20
Selective PredictionWMT de 19
PRR46.5
20
Selective PredictionWMT ru 19
PRR33.8
20
Selective PredictionMMLU
PRR75.9
20
Selective PredictionbAbI
PRR77
20
Selective PredictionSamSum
PRR32.6
20
Showing 10 of 19 rows

Other info

Follow for update