Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

About

Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi• 2023

Related benchmarks

TaskDatasetResultRank
Hallucination DetectionHaluEval (test)
AUC-ROC65.15
126
ReasoningMATH 500
Accuracy (%)71.6
59
Claim VerificationPerplexityAI (test)
Verification Confidence75.4
52
Hallucination DetectionSQuAD (test)
AUROCr71.2
48
Hallucination DetectionGSM8K (test)
AUROC (Reference)65.69
48
Semantic Hallucination DetectionPAWS
AUROC73.46
36
Hallucination DetectionRAGTruth Llama2-7B (test)
Accuracy53.33
21
Hallucination DetectionDolly Llama2-7B (test)
Acc53.54
21
Hallucination DetectionRAGTruth Llama2-13B (test)
Acc45.33
21
Hallucination DetectionDolly Llama2-13B (test)
Accuracy46.46
21
Showing 10 of 29 rows

Other info

Follow for update