Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

About

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Shikib Mehri, Maxine Eskenazi• 2020

Related benchmarks

TaskDatasetResultRank
Dialog EvaluationTopical-Chat
Spearman Correlation0.4877
35
Turn-level correlation with human Overall Quality ratingsPersonaChat turn-level
Spearman Correlation0.4814
20
Dialogue EvaluationEmpatheticDialogues
Spearman Correlation0.255
19
Dialogue EvaluationTopical-Chat turn-level
Naturalness (Pearson r)0.337
11
Dialogue EvaluationTopical-Eval
Spearman Correlation0.423
10
Dialogue EvaluationPersona-Eval
Spearman Correlation0.571
10
Dialogue EvaluationDailyDialog (eval)
Spearman Correlation0.367
10
Dialogue EvaluationMovie Eval
Spearman Correlation0.366
10
Dialogue EvaluationTwitter-Eval
Spearman Correlation0.166
10
Dialogue EvaluationUSR-PersonaChat (test)
Pearson Correlation (r)0.495
7
Showing 10 of 15 rows

Other info

Code

Follow for update