Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

About

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

Alessandro Scir\`e, Karim Ghonim, Roberto Navigli• 2024

Related benchmarks

TaskDatasetResultRank
Factuality EvaluationAggreFact-CNN (OLD)
Balanced Accuracy82.1
15
Factuality EvaluationAggreFact-CNN (FTS)
Balanced Accuracy68.2
15
Factuality EvaluationAggreFact-XSum FTS
Balanced Accuracy73.9
15
Factuality EvaluationAggreFact CNN (EXF)
Balanced Accuracy68.8
15
Factuality EvaluationAggreFact (FTSOTA)
Balanced Accuracy (CNN-FTS)70.5
14
Factuality EvaluationAggreFact-XSum (EXF)
Balanced Accuracy0.735
14
Factuality EvaluationAggreFact-XSum (OLD)
Balanced Accuracy69.9
14
Factuality EvaluationLong-form summarization factuality dataset (test)
Balanced Accuracy66.2
5
Showing 8 of 8 rows

Other info

Code

Follow for update