MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
About
A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Metrics correlation with human judgment | WebNLG challenge 2017 | Spearman Correlation (rho)0.9 | 45 | |
| Summarization Evaluation | SummEval | Avg Spearman Rho0.191 | 40 | |
| Summarization | Newsroom (test) | Pearson Correlation0.337 | 36 | |
| Question Answering | MOCHA (test) | Pearson's r0.592 | 36 | |
| Image Captioning Hallucination Detection | FOIL (test) | Accuracy88.4 | 28 | |
| Correlation with human judgment | Flickr8K-CF | Tau B22.8 | 26 | |
| Dialogue Evaluation Human Correlation | Topical-Chat | Naturalness Pearson (r)0.169 | 26 | |
| Story Generation | ROC stories (test) | Pearson's r0.391 | 24 | |
| Data-to-text evaluation | SFHOT | Spearman Correlation0.172 | 24 | |
| Data-to-text evaluation | SFRES | Spearman Correlation0.153 | 24 |