| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Natural Language Inference | MultiNLI (test) | Average Worst-Group Accuracy88.05 | 81 | |
| Natural Language Inference | MultiNLI matched (test) | Accuracy85.38 | 65 | |
| Natural Language Inference | MultiNLI Mismatched | Accuracy79.1 | 60 | |
| Natural Language Inference | MultiNLI mismatched (test) | Accuracy81.4 | 56 | |
| Natural Language Inference | MultiNLI Matched | Accuracy80.2 | 49 | |
| Natural Language Inference | MultiNLI mismatched (cross-domain) RepEval 2017 (test) | Accuracy75.8 | 25 | |
| Natural Language Inference | MultiNLI | Accuracy82.4 | 23 | |
| Natural Language Inference | MultiNLI matched (dev) | Accuracy88.4 | 23 | |
| Text Classification | MultiNLI (test) | WGA81.3 | 18 | |
| Natural Language Inference | MultiNLI matched (in-domain) RepEval 2017 (test) | Accuracy76.8 | 18 | |
| Confidence Calibration | MultiNLI Mismatch (test) | ECE0.0071 | 16 | |
| Natural Language Understanding | MultiNLI (Match) | ECE1.02 | 16 | |
| Natural Language Inference | MultiNLI mismatched (dev) | Accuracy88.4 | 11 | |
| Natural Language Inference | MultiNLI matched/mismatched | Accuracy92.6 | 10 | |
| Natural Language Inference | MultiNLI matched (in-domain) | Accuracy74.6 | 8 | |
| Natural Language Inference | MultiNLI matched (val) | Accuracy91.7 | 8 | |
| Text Classification | MultiNLI | Average Accuracy81.1 | 7 | |
| Natural Language Inference | MultiNLI WILDS (test) | IID Accuracy82.1 | 6 | |
| Natural Language Inference | MultiNLI reconstructed with controlled shortcut injection (test) | MSTPS0.797 | 5 | |
| Natural Language Inference | MultiNLI controlled shortcut injection | Accuracy32.3 | 5 | |
| Natural Language Inference | MultiNLI (val) | Accuracy73.17 | 5 |