| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | Commonsense Reasoning (BoolQ, PIQA, SIQA, HellaS., WinoG., ARC-e, ARC-c, OBQA) (test) | BoolQ Accuracy88 | 238 | |
| Commonsense Reasoning | Commonsense Reasoning (BoolQ, PIQA, SIQA, HellaS., WinoG., ARC-e, ARC-c, OBQA) | BoolQ Accuracy89.69 | 223 | |
| Commonsense Reasoning | Commonsense Reasoning | Accuracy85 | 57 | |
| Commonsense Reasoning | Commonsense Reasoning (BoolQ, PIQA, HellaSwag, Winogrande) zero-shot | Avg Commonsense Accuracy84.9 | 34 | |
| Commonsense Reasoning | Commonsense Reasoning (PIQA, WinoG., HellaS., BoolQ, SIQA, OBQA) (test) | PIQA Accuracy89.9 | 32 | |
| Visual Reasoning | Commonsense Reasoning | Jaccard Index (J)8 | 30 | |
| Commonsense Reasoning | Commonsense Reasoning | BoolQ Accuracy76.5 | 29 | |
| Commonsense Reasoning | Commonsense Reasoning | BoolQ Accuracy75.1 | 27 | |
| Commonsense Reasoning | Commonsense Reasoning Tasks (ARC-C, ARC-E, HellaSwag, LAMBADA, PIQA, WinoGrande) | ARC-C Accuracy41.47 | 25 | |
| Commonsense Reasoning | Commonsense Reasoning | WinoGrande Accuracy (WG)80.66 | 24 | |
| Commonsense Reasoning | Commonsense Reasoning (OBQA, ARC-C, Wino, PIQA, Social, ARC-E, BoolQ, Hella) | OBQA94.8 | 24 | |
| Zero-shot Commonsense Reasoning | Commonsense Reasoning PIQA HellaSwag WinoGrande ARC-Easy OpenBookQA MathQA (test) | Zero-shot Accuracy59 | 21 | |
| Commonsense Reasoning | Commonsense Reasoning (test) | BoolQ Accuracy70.13 | 21 | |
| Commonsense Reasoning | Commonsense Reasoning | ARC-E Accuracy81.19 | 20 | |
| Commonsense Reasoning | Commonsense Reasoning LLaMA2-7B | Average Accuracy79.68 | 18 | |
| Commonsense Reasoning | Commonsense Reasoning Task | HellaSwag Accuracy53.63 | 12 | |
| Commonsense Reasoning | Commonsense Reasoning 8 datasets | BoolQ Accuracy73.6 | 11 | |
| Agentic Routing | Commonsense Reasoning (CS) | Accuracy82.7 | 10 | |
| Commonsense Reasoning | Commonsense Reasoning (lm-evaluation-harness) zero-shot | LAMBADA Perplexity11.86 | 10 | |
| Commonsense Reasoning | Commonsense Reasoning (BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, OBQA) LLaMA2 7B backbone (test) | BoolQ Accuracy88.5 | 10 | |
| Commonsense Reasoning | Commonsense Reasoning MC+PEFT LLaMA3.2-1B (test) | BoolQ Accuracy62.4 | 8 | |
| Commonsense Reasoning | Commonsense Reasoning (OpenBookQA, ARC-E, ARC-C, WinoGrande, PIQA, MathQA, HellaSwag) | OpenBookQA34 | 7 | |
| Commonsense Reasoning | Commonsense Reasoning LLaMA-3.2-3B-Instruct (test) | ARC-c76.1 | 6 | |
| Commonsense Reasoning | Commonsense Reasoning (HellaSwag, OBQA, WinoGrande, ARC, PIQA) | HellaSwag52.3 | 5 | |
| Commonsense Reasoning | Commonsense Reasoning Tasks HellaSwag, PIQA, WinoGrande | HellaSwag Accuracy33.9 | 4 |