| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | BigBenchHard | Accuracy71.7 | 18 | |
| Discriminative tasks | BigBench 13 tasks (val) | Accuracy58.7 | 17 | |
| Audio-based Reasoning | BigBench Audio | Accuracy73.77 | 8 | |
| Language Modeling and Reasoning | BigBench (Lamb, SQuAD, CoQA, BBH, LSAT, LangID) | Avg Score24 | 8 | |
| Instruction Induction | BigBench Instruction Induction (BBII) (test) | BBII Text Classification Score60.14 | 6 | |
| Reasoning | BigBenchHard | Accuracy (BigBenchHard)82.4 | 5 | |
| Multi-task Reasoning | BigBench Hard | Score31.1 | 5 | |
| Streaming Voice-Agent Interaction Efficiency | BigBench Audio | NFE80.73 | 5 | |
| Reasoning | BigBench Extra Hard | mean@414.3 | 4 | |
| Contextual Reasoning | BigBenchHard | EM63.23 | 4 | |
| Reasoning and Language Understanding | BigBench Emergent Suite (BBES) | Navigate67 | 2 | |
| LLM Performance Prediction | BigBench (val) | Metric- | 0 |