USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
About
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
Shikib Mehri, Maxine Eskenazi• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dialog Evaluation | Topical-Chat | Spearman Correlation0.4877 | 35 | |
| Turn-level correlation with human Overall Quality ratings | PersonaChat turn-level | Spearman Correlation0.4814 | 20 | |
| Dialogue Evaluation | EmpatheticDialogues | Spearman Correlation0.255 | 19 | |
| Dialogue Evaluation | Topical-Chat turn-level | Naturalness (Pearson r)0.337 | 11 | |
| Dialogue Evaluation | Topical-Eval | Spearman Correlation0.423 | 10 | |
| Dialogue Evaluation | Persona-Eval | Spearman Correlation0.571 | 10 | |
| Dialogue Evaluation | DailyDialog (eval) | Spearman Correlation0.367 | 10 | |
| Dialogue Evaluation | Movie Eval | Spearman Correlation0.366 | 10 | |
| Dialogue Evaluation | Twitter-Eval | Spearman Correlation0.166 | 10 | |
| Dialogue Evaluation | USR-PersonaChat (test) | Pearson Correlation (r)0.495 | 7 |
Showing 10 of 15 rows