Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

About

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

Esin Durmus, He He, Mona Diab• 2020

Related benchmarks

TaskDatasetResultRank
Factual Consistency EvaluationSummaC
CGS53.7
52
Factual Consistency EvaluationQAGS XSUM
Spearman Correlation-6.5
39
Factual Consistency EvaluationQAGS CNNDM
Spearman Correlation-7.2
38
Factual Consistency EvaluationTRUE benchmark
PAWS (AUC-ROC)50
37
Factual Consistency EvaluationSummEval
Spearman Correlation0.2
36
Factual Consistency EvaluationFRANK CNNDM
Spearman Correlation-2.9
30
Factual Consistency EvaluationSamSum
Spearman Correlation0.00e+0
30
Factual Consistency EvaluationFRANK-XSum (FRK-X)
Spearman Correlation1.5
30
Factual Consistency EvaluationSamSum (test)
Pearson Correlation Coefficient2.7
22
Factual Consistency EvaluationQAGS-XSum (test)
Pearson Correlation Coefficient-0.73
22
Showing 10 of 25 rows

Other info

Follow for update