| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Natural Language Understanding | SuperGLUE (dev) | Average Score93.2 | 91 | |
| Natural Language Understanding | SuperGLUE | SGLUE Score91.3 | 84 | |
| Natural Language Understanding | SuperGLUE (test) | BoolQ Accuracy92.4 | 63 | |
| Natural Language Understanding | SuperGLUE | SST-2 Accuracy96 | 18 | |
| Natural Language Understanding | SuperGLUE RoBERTa-large (test) | ReCoRD89.21 | 17 | |
| Natural Language Understanding | SuperGLUE few-shot | BoolQ Accuracy0.818 | 16 | |
| Natural Language Understanding | SuperGLUE 1,000 examples | BoolQ Accuracy84 | 15 | |
| Natural Language Understanding | SuperGLUE | WSC Score57.69 | 13 | |
| Natural Language Processing | SuperGLUE Full, excl. ReCoRD (dev) | Macro Avg Score70.03 | 13 | |
| Natural Language Processing | SuperGLUE 1k samples, excl. ReCoRD (dev) | Macro Avg Score65.84 | 13 | |
| Natural Language Processing | SuperGLUE 100 samples, excl. ReCoRD (dev) | Macro Avg Score59.88 | 13 | |
| Natural Language Understanding | SuperGLUE Zero-shot | BoolQ Accuracy88 | 11 | |
| Natural Language Understanding | SuperGLUE 1,000 examples (test) | BoolQ86.7 | 10 | |
| Text Classification | SuperGLUE (val) | Average Validation Score89.2 | 10 | |
| Failure Diagnosis | SuperGLUE | Macro Similarity Score36 | 8 | |
| Natural Language Understanding | SuperGLUE v1 (test) | BoolQ Acc91.3 | 7 | |
| Automated Probing | SuperGLUE | Error Rate38 | 3 |