| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning | BIG-Bench Hard | Accuracy91.1 | 68 | |
| General Reasoning | BIG-bench | Accuracy @ t174.6 | 29 | |
| Reasoning | BIG-Bench Hard (BBH) (test) | Average Accuracy79.4 | 28 | |
| Symbolic and Logical Reasoning | Big-Bench Hard (BBH) | Exact Match Performance88.1 | 22 | |
| Reasoning | Big-bench Hard (BBH) | Exact Match (EM)53.53 | 20 | |
| Multitask Language Understanding | BIG-bench-lite 24 tasks | Score3,777 | 17 | |
| Generation | Big-Bench Hard (test) | Exact Match57.9 | 17 | |
| Various NLP tasks (NLU and Reasoning) | BIG-bench (unseen) | Known Unknowns Score86.96 | 15 | |
| Date Understanding | BIG-bench Hard Date Understanding (test) | Test Accuracy75.2 | 14 | |
| General Language Understanding | BIG-bench Mimicked | Sports Score99.7 | 11 | |
| General Language Understanding | BIG-bench Original | Sports Score99.4 | 11 | |
| Reasoning | BIG-Bench Extra Hard | Score37.8 | 10 | |
| Multi-task Language Understanding | BIG-bench | Hindu Knowledge80 | 10 | |
| Complex Reasoning | BIG-bench Hard | Orig Score39.3 | 7 | |
| Algorithmic Reasoning | Big-Bench Hard Word Sorting and Multi-step Arithmetic (test) | WS Accuracy80 | 7 | |
| Multiple Choice Question Answering | BIG-bench HHH Eval | Overall Score87 | 7 | |
| Spoken Dialogue | Big Bench Audio (test) | S2T Accuracy72.9 | 6 | |
| LLM Workflow Optimization | Big-Bench Hard (BBH) (test) | BBH Overall Accuracy78.6 | 6 | |
| Task-solving | BIG-Bench Hard (BBH) (test) | Boolean Expressions84 | 6 | |
| Natural Language Understanding | BIG-Bench Hard (BBH) | Accuracy42.1 | 5 | |
| Diverse reasoning tasks | BIG-bench Hard (BBH) | Boolean Expressions83.2 | 5 | |
| Reasoning | BIG-Bench Hard (train) | Causal Judgment67.7 | 5 | |
| Multi-task Language Understanding | BIG-bench | Anachronisms49.1 | 5 | |
| General Language Capability | BIG-bench 57 Task | Accuracy (Weighted)48.7 | 5 | |
| Movie Recommendation | BIG-bench Hard Movie Recommendation (test) | Test Accuracy79 | 4 |