| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-fidelity Multi-armed Bandit | NLI shared evaluation pool | Mean Cost-Weighted Pseudo-Regret3,277.7 | 18 | |
| Natural Language Inference | NLI adversarial benchmark (test) | Average Score75.4 | 18 | |
| Natural Language Inference | NLI | Accuracy91.2 | 14 | |
| Natural Language Inference | NLI ANLI and HANS (unseen) | ANLI Score32.4 | 9 | |
| Prompt Injection Detection | NLI | Detection Rate (TPR/FPR)100 | 8 | |
| Natural Language Inference | NLI domain average | Best Accuracy87.5 | 8 | |
| Prompt Localization | NLI | RL Score97.9 | 3 | |
| Natural Language Inference | NLI (test) | Relative CPU Speed2.89 | 2 |