| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Natural Language Inference | MNLI (matched) | Accuracy91.7 | 110 | |
| Natural Language Inference | MNLI | Accuracy (matched)90.8 | 80 | |
| Natural Language Inference | MNLI (mismatched) | Accuracy91 | 68 | |
| Natural Language Inference | MNLI (dev) | Acc (m)90.2 | 44 | |
| Natural Language Inference | MNLI (test) | Accuracy0.898 | 38 | |
| Natural Language Inference | MNLI | Accuracy86.2 | 36 | |
| Text Classification | MNLI | Accuracy87.45 | 32 | |
| Classification | MNLI (val) | Accuracy84.17 | 32 | |
| Natural Language Inference | MNLI mm | Accuracy90.7 | 30 | |
| Natural Language Inference | MNLI (val) | Accuracy92.13 | 26 | |
| Natural Language Inference | MNLI | Accuracy87.98 | 22 | |
| Natural Language Inference | MNLI few-shot zero-shot | Accuracy71.1 | 16 | |
| Structural Bias Evaluation | MNLI | Accuracy98.1 | 14 | |
| Natural Language Inference | MNLI-m | Accuracy77.2 | 13 | |
| Natural Language Inference | MNLI Unknown Bias (in-distribution) | Accuracy84.2 | 13 | |
| Natural Language Inference | MNLI HardSP (challenge) | Accuracy83.2 | 13 | |
| Natural Language Inference | MNLI HardCD (challenge) | Accuracy0.803 | 13 | |
| Natural Language Inference | MNLI Hypothesis-only Bias (in-distribution) | Accuracy84.2 | 13 | |
| Natural Language Inference | MNLI Syntactic Bias (in-distribution) | Accuracy84.3 | 13 | |
| Natural Language Inference | MNLI (all combined) | Accuracy85.98 | 12 | |
| Natural Language Inference | MNLI-m (dev) | Accuracy90.6 | 12 | |
| Hallucination Detection | MNLI (test) | AuROC100 | 10 | |
| Natural Language Inference | MNLI | Accuracy89.5 | 10 | |
| Performance prediction | MNLI source domains (out-of-domain) | ROC AUC0.683 | 10 | |
| Performance prediction | MNLI source domains (in-domain) | ROC AUC0.699 | 10 |