Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

About

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger• 2019

Related benchmarks

TaskDatasetResultRank
Metrics correlation with human judgmentWebNLG challenge 2017
Spearman Correlation (rho)0.9
45
Summarization EvaluationSummEval
Avg Spearman Rho0.191
40
SummarizationNewsroom (test)
Pearson Correlation0.337
36
Question AnsweringMOCHA (test)
Pearson's r0.592
36
Image Captioning Hallucination DetectionFOIL (test)
Accuracy88.4
28
Correlation with human judgmentFlickr8K-CF
Tau B22.8
26
Dialogue Evaluation Human CorrelationTopical-Chat
Naturalness Pearson (r)0.169
26
Story GenerationROC stories (test)
Pearson's r0.391
24
Data-to-text evaluationSFHOT
Spearman Correlation0.172
24
Data-to-text evaluationSFRES
Spearman Correlation0.153
24
Showing 10 of 31 rows

Other info

Follow for update