Unsupervised Evaluation of Interactive Dialog with DialoGPT
About
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dialogue commonsense evaluation | DECO (test) | Pearson Correlation-0.12 | 6 | |
| Dialogue commonsense evaluation | ConTurE | Pearson Correlation-0.08 | 6 | |
| Dialogue Evaluation | DSTC9 Interactive Conversation (test) | Pearson0.67 | 3 |