Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MENLI: Robust Evaluation Metrics from Natural Language Inference

About

Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

Yanran Chen, Steffen Eger• 2022

Related benchmarks

TaskDatasetResultRank
Factuality EvaluationAggreFact-XSum FTS
Balanced Accuracy58.3
15
Factuality EvaluationAggreFact-CNN (FTS)
Balanced Accuracy51.7
15
Factuality EvaluationAggreFact-CNN (OLD)
Balanced Accuracy68.4
15
Factuality EvaluationAggreFact CNN (EXF)
Balanced Accuracy52.8
15
Factuality EvaluationAggreFact-XSum (OLD)
Balanced Accuracy73.9
14
Factuality EvaluationAggreFact (FTSOTA)
Balanced Accuracy (CNN-FTS)63.4
14
Factuality EvaluationAggreFact-XSum (EXF)
Balanced Accuracy0.597
14
Factuality EvaluationLong-form summarization factuality dataset (test)
Balanced Accuracy61.7
5
Showing 8 of 8 rows

Other info

Follow for update