Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Evaluating Factuality in Generation with Dependency-level Entailment

About

Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do not localize which part of a generation is non-factual. In this paper, we propose a new formulation of entailment that decomposes it at the level of dependency arcs. Rather than focusing on aggregate decisions, we instead ask whether the semantic relationship manifested by individual dependency arcs in the generated output is supported by the input. Human judgments on this task are difficult to obtain; we therefore propose a method to automatically create data based on existing entailment or paraphrase corpora. Experiments show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods or those based on question generation, while additionally localizing the erroneous parts of the generation.

Tanya Goyal, Greg Durrett• 2020

Related benchmarks

TaskDatasetResultRank
Abstractive SummarizationGigaword (test)--
58
Factual Consistency EvaluationSummaC
CGS52.4
52
Factual Consistency EvaluationQAGS XSUM
Spearman Correlation37.5
39
Factual Consistency EvaluationQAGS CNNDM
Spearman Correlation37.1
38
Factual Consistency EvaluationTRUE benchmark
PAWS (AUC-ROC)55.8
37
Factual Consistency EvaluationSummEval
Spearman Correlation36.2
36
Factual Consistency EvaluationFRANK-XSum (FRK-X)
Spearman Correlation32.1
30
Factual Consistency EvaluationSamSum
Spearman Correlation18.6
30
Factual Consistency EvaluationFRANK CNNDM
Spearman Correlation36.9
30
Factual Consistency EvaluationXSumFaith (test)
Pearson Correlation Coefficient42.5
22
Showing 10 of 32 rows

Other info

Follow for update