DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
About
A long-standing goal of task-oriented dialogue research is the ability to flexibly adapt dialogue models to new domains. To progress research in this direction, we introduce DialoGLUE (Dialogue Language Understanding Evaluation), a public benchmark consisting of 7 task-oriented dialogue datasets covering 4 distinct natural language understanding tasks, designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. We release several strong baseline models, demonstrating performance improvements over a vanilla BERT architecture and state-of-the-art results on 5 out of 7 tasks, by pre-training on a large open-domain dialogue corpus and task-adaptive self-supervised training. Through the DialoGLUE benchmark, the baseline methods, and our evaluation scripts, we hope to facilitate progress towards the goal of developing more general task-oriented dialogue models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dialogue State Tracking | MultiWOZ 2.1 (test) | Joint Goal Accuracy58.7 | 85 | |
| Intent Classification | Banking77 | Accuracy94.77 | 24 | |
| Intent Detection | HWU 10-shot (test) | Accuracy86.28 | 16 | |
| Intent Detection | CLINC 10-shot (test) | Accuracy93.97 | 16 | |
| Intent Detection | BANKING 10-shot (test) | Accuracy85.95 | 16 | |
| Intent Detection | HWU 5-shot (test) | Accuracy0.8001 | 12 | |
| Intent Detection | CLINC 5-shot (test) | Accuracy90.49 | 12 | |
| Intent Detection | BANKING 5-shot (test) | Accuracy77.75 | 12 | |
| Intent Detection | HWU Full (test) | Accuracy93.03 | 11 | |
| Intent Detection | CLINC Full (test) | Accuracy97.31 | 11 |