Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

About

Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, Nanyun Peng• 2019

Related benchmarks

TaskDatasetResultRank
Dialogue EvaluationEmpatheticDialogues
Spearman Correlation0.148
19
Dialogue EvaluationTopical-Eval
Spearman Correlation0.348
10
Dialogue EvaluationTwitter-Eval
Spearman Correlation0.217
10
Dialogue EvaluationMovie Eval
Spearman Correlation0.388
10
Dialogue EvaluationDailyDialog (eval)
Spearman Correlation0.285
10
Dialogue EvaluationPersona-Eval
Spearman Correlation0.384
10
Dialogue EvaluationConvAI2
Pearson Correlation0.266
9
Showing 7 of 7 rows

Other info

Follow for update