| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | BigBenchHard | Accuracy (BigBenchHard)100 | 22 | |
| General Reasoning | BigBench-Lite Topic Domains | BBL Score66.8 | 18 | |
| Commonsense Reasoning | BigBenchHard | Accuracy71.7 | 18 | |
| Discriminative tasks | BigBench 13 tasks (val) | Accuracy58.7 | 17 | |
| Question Answering | BIGBENCH II | True WS Score100 | 12 | |
| Generative Classification | BigBench (test) | Accuracy76.6 | 10 | |
| Natural Language Processing | BigBench II | Accuracy Degradation (%)-0.37 | 9 | |
| Audio-based Reasoning | BigBench Audio | Accuracy73.77 | 8 | |
| Language Modeling and Reasoning | BigBench (Lamb, SQuAD, CoQA, BBH, LSAT, LangID) | Avg Score24 | 8 | |
| Instruction Induction | BigBench Instruction Induction (BBII) (test) | BBII Text Classification Score60.14 | 6 | |
| Linguistic Reasoning | BigBench Hard Hyperbaton | Accuracy80.2 | 5 | |
| Linguistic Reasoning | BigBench Hard Snarks | Accuracy0.554 | 5 | |
| Logical Reasoning | BigBench Hard Formal Fallacies | Accuracy58.2 | 5 | |
| Multi-task Reasoning | BigBench Hard | Score31.1 | 5 | |
| Streaming Voice-Agent Interaction Efficiency | BigBench Audio | NFE80.73 | 5 | |
| Reasoning | BigBench Extra Hard | mean@414.3 | 4 | |
| Contextual Reasoning | BigBenchHard | EM63.23 | 4 | |
| Reasoning | BigBench-H SEM variant | Accuracy95.08 | 2 | |
| Reasoning and Language Understanding | BigBench Emergent Suite (BBES) | Navigate67 | 2 | |
| LLM Performance Prediction | BigBench (val) | Metric- | 0 |