USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

About

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Shikib Mehri, Maxine Eskenazi• 2020

Related benchmarks

Task	Dataset	Result
Dialog Evaluation	Topical-Chat	Spearman Correlation0.4877	35
Turn-level correlation with human Overall Quality ratings	PersonaChat turn-level	Spearman Correlation0.4814	20
Dialogue Evaluation	EmpatheticDialogues	Spearman Correlation0.255	19
Chit-chat conversation evaluation correlation	USR-Persona	Pearson Correlation (r)0.607	11
Dialogue Evaluation	Topical-Chat turn-level	Naturalness (Pearson r)0.337	11
Chit-chat conversation evaluation correlation	USR-Topical	Pearson Correlation0.416	11
Dialogue Evaluation	Topical-Eval	Spearman Correlation0.423	10
Dialogue Evaluation	Persona-Eval	Spearman Correlation0.571	10
Dialogue Evaluation	DailyDialog (eval)	Spearman Correlation0.367	10
Dialogue Evaluation	Movie Eval	Spearman Correlation0.366	10

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord