Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

About

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger• 2019

Related benchmarks

TaskDatasetResultRank
Correlation with human judgmentFlickr8K-CF
Tau B22.8
48
Metrics correlation with human judgmentWebNLG challenge 2017
Spearman Correlation (rho)0.9
45
Summarization EvaluationSummEval
Avg Spearman Rho0.191
45
Data-to-text evaluationSFRES
Spearman Correlation0.153
39
SummarizationNewsroom (test)
Pearson Correlation0.337
36
Question AnsweringMOCHA (test)
Pearson's r0.592
36
Image Captioning Hallucination DetectionFOIL (test)
Accuracy88.4
28
Dialogue Evaluation Human CorrelationTopical-Chat
Naturalness Pearson (r)0.169
26
Machine Translation EvaluationWMT 2019 (test)
de-en0.25
25
Data-to-text evaluationSFHOT
Spearman Correlation (Naturalness)0.242
25
Showing 10 of 39 rows

Other info

Follow for update