Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

About

Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, we first consider that the graph structure constituted with topics in a dialogue can accurately depict the underlying communication logic, which is a more natural way to produce persuasive metrics. Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence. The graph representations are obtained by reasoning over topic-level dialogue graphs enhanced with the evidence from a commonsense graph, including k-hop neighboring representations and hop-attention weights. Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgements. Besides, we release a new large-scale human evaluation benchmark to facilitate future research on automatic metrics.

Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, Xiaodan Liang• 2020

Related benchmarks

TaskDatasetResultRank
Dialogue EvaluationEmpatheticDialogues
Spearman Correlation0.344
19
Dialogue EvaluationMovie Eval
Spearman Correlation0.612
10
Dialogue EvaluationDailyDialog (eval)
Spearman Correlation0.533
10
Dialogue EvaluationPersona-Eval
Spearman Correlation0.583
10
Dialogue EvaluationTopical-Eval
Spearman Correlation0.217
10
Dialogue EvaluationTwitter-Eval
Spearman Correlation0.122
10
Dialogue EvaluationConvAI2
Pearson Correlation0.496
9
Dialogue EvaluationUSR-PersonaChat (test)
Pearson Correlation (r)0.358
7
Dialogue EvaluationUSR-TopicalChat (test)
Pearson Correlation (r)0.2
7
Dialogue EvaluationChatbot Domain
Correlation Score0.87
6
Showing 10 of 10 rows

Other info

Follow for update