| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Natural Language Inference | SNLI (test) | Accuracy94.7 | 694 | |
| Natural Language Inference | SNLI | Accuracy100 | 196 | |
| Natural Language Inference | SNLI (train) | Accuracy99.7 | 154 | |
| Natural Language Inference | SNLI (dev) | Accuracy93.6 | 71 | |
| Counterfactual Generation | SNLI Hypothesis | LFR83 | 37 | |
| Counterfactual Generation | SNLI Premise | LFR0.759 | 37 | |
| Natural Language Inference | SNLI hard 1.0 (test) | Accuracy84.48 | 27 | |
| Explanation Faithfulness | SNLI | Delta AF0.989 | 24 | |
| Masked Language Modeling | SNLI (randomly sampled) | PPL (U)8.57 | 20 | |
| Natural Language Inference | SNLI 1.0 (test) | Accuracy90.67 | 19 | |
| Explanation Evaluation | SNLI (test) | Sufficiency43.76 | 16 | |
| Natural Language Inference | SNLI-Neg | Accuracy75.9 | 14 | |
| Membership Inference Attack | SNLI | ROC AUC99.8 | 12 | |
| Natural Language Inference | SNLI source: MNLI (test) | Accuracy80.2 | 12 | |
| Natural Language Inference | SNLI | Correlation Coefficient83.05 | 10 | |
| Natural Language Inference | SNLI Combined variant (test) | Accuracy88.93 | 10 | |
| Natural Language Inference | SNLI Noise variant (test) | Accuracy89.77 | 10 | |
| Natural Language Inference | SNLI Emoji variant (test) | Accuracy88.96 | 10 | |
| Natural Language Inference | SNLI Slang variant (test) | Accuracy92.8 | 10 | |
| Natural Language Inference | SNLI Original (test) | Accuracy93.12 | 10 | |
| Ranking correlation with full dataset evaluation | SNLI | Kendall Correlation0.93 | 10 | |
| Human Alignment | SNLI | R@118.5 | 9 | |
| Natural Language Inference | SNLI | Macro-F172.59 | 9 | |
| Semantic Differentiation | SNLI | Wasserstein Distance3.72 | 9 | |
| Comparative Reasoning | delta-SNLI | Accuracy88.9 | 9 |