Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

About

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Shikib Mehri, Maxine Eskenazi• 2020

Related benchmarks

TaskDatasetResultRank
Dialog EvaluationTopical-Chat
Spearman Correlation0.4877
35
Turn-level correlation with human Overall Quality ratingsPersonaChat turn-level
Spearman Correlation0.4814
20
Dialogue EvaluationEmpatheticDialogues
Spearman Correlation0.255
19
Chit-chat conversation evaluation correlationUSR-Persona
Pearson Correlation (r)0.607
11
Dialogue EvaluationTopical-Chat turn-level
Naturalness (Pearson r)0.337
11
Chit-chat conversation evaluation correlationUSR-Topical
Pearson Correlation0.416
11
Dialogue EvaluationTopical-Eval
Spearman Correlation0.423
10
Dialogue EvaluationPersona-Eval
Spearman Correlation0.571
10
Dialogue EvaluationDailyDialog (eval)
Spearman Correlation0.367
10
Dialogue EvaluationMovie Eval
Spearman Correlation0.366
10
Showing 10 of 17 rows

Other info

Code

Follow for update