| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Natural Language Understanding | SuperGLUE (dev) | Average Score93.2 | 91 | |
| Natural Language Understanding | SuperGLUE | SGLUE Score91.3 | 84 | |
| Natural Language Understanding | SuperGLUE (test) | BoolQ Accuracy92.4 | 74 | |
| Natural Language Understanding | SuperGLUE | CB Accuracy94.5 | 32 | |
| Natural Language Understanding | SuperGLUE | WSC Score76.9 | 25 | |
| Natural Language Understanding | SuperGLUE | MultiRC Score75.9 | 22 | |
| Natural Language Understanding | SuperGLUE | SST-2 Accuracy96 | 18 | |
| Natural Language Understanding | SuperGLUE RoBERTa-large (test) | ReCoRD89.21 | 17 | |
| Natural Language Understanding | SuperGLUE few-shot | BoolQ Accuracy0.818 | 16 | |
| Natural Language Understanding | SuperGLUE 1,000 examples | BoolQ Accuracy84 | 15 | |
| Multiple Choice | SuperGLUE | COPA Score83 | 14 | |
| Classification | SuperGLUE | CB Accuracy96.4 | 14 | |
| Natural Language Processing | SuperGLUE Full, excl. ReCoRD (dev) | Macro Avg Score70.03 | 13 | |
| Natural Language Processing | SuperGLUE 1k samples, excl. ReCoRD (dev) | Macro Avg Score65.84 | 13 | |
| Natural Language Processing | SuperGLUE 100 samples, excl. ReCoRD (dev) | Macro Avg Score59.88 | 13 | |
| Natural Language Understanding | SuperGLUE (test val) | SST-2 Accuracy96 | 12 | |
| Natural Language Understanding | SuperGLUE Zero-shot | BoolQ Accuracy88 | 11 | |
| Natural Language Understanding | SuperGLUE 1,000 examples (test) | BoolQ86.7 | 10 | |
| Text Classification | SuperGLUE (val) | Average Validation Score89.2 | 10 | |
| NLU and Question Answering | SuperGLUE | SST-2 Accuracy94.7 | 9 | |
| Failure Diagnosis | SuperGLUE | Macro Similarity Score36 | 8 | |
| Natural Language Understanding | SuperGLUE v1 (test) | BoolQ Acc91.3 | 7 | |
| Natural Language Understanding | SuperGLUE | BoolQ Accuracy88.5 | 6 | |
| Natural Language Understanding | SuperGLUE | Accuracy (SST2)94.81 | 6 | |
| Classification | SuperGLUE | RTE Score72.2 | 6 |