MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

About

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger• 2019

Related benchmarks

Task	Dataset	Result
Correlation with human judgment	Flickr8K-CF	Tau B22.8	48
Metrics correlation with human judgment	WebNLG challenge 2017	Spearman Correlation (rho)0.9	45
Summarization Evaluation	SummEval	Avg Spearman Rho0.191	45
Data-to-text evaluation	SFRES	Spearman Correlation0.153	39
Summarization	Newsroom (test)	Pearson Correlation0.337	36
Question Answering	MOCHA (test)	Pearson's r0.592	36
Image Captioning Hallucination Detection	FOIL (test)	Accuracy88.4	28
Dialogue Evaluation Human Correlation	Topical-Chat	Naturalness Pearson (r)0.169	26
Machine Translation Evaluation	WMT 2019 (test)	de-en0.25	25
Data-to-text evaluation	SFHOT	Spearman Correlation (Naturalness)0.242	25

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord