GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
About
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)91.5 | 504 | |
| Natural Language Understanding | GLUE | SST-293.2 | 452 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy90.4 | 416 | |
| Text Classification | RTE | Accuracy83.51 | 78 | |
| Sentiment Classification | SST (test) | Accuracy91.6 | 37 | |
| Natural Language Understanding | GLUE 1.0 (test) | CoLA (MCC)36 | 25 | |
| Natural Language Understanding | GLUE SST-2, QQP, MNLI-m, MNLI-mm official (test) | SST-2 Accuracy90.4 | 9 | |
| OCR | SST-2 (test) | Accuracy80 | 5 | |
| Text Classification | MRPC GLUE | Accuracy92.08 | 2 |