| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning | BIG-Bench Hard | Accuracy91.1 | 68 | |
| Reasoning | BIG-Bench Hard (BBH) (test) | Average Accuracy87.3 | 56 | |
| General Reasoning | BIG-bench | Accuracy (General)81.6 | 36 | |
| Reasoning | Big-Bench Hard (BBH) | Accuracy60.39 | 33 | |
| Reasoning | BIG-Bench Hard (train) | Accuracy91.9 | 28 | |
| Word Sorting | BIG-bench Hard Word Sorting (test) | Test Accuracy45 | 26 | |
| Symbolic and Logical Reasoning | Big-Bench Hard (BBH) | Exact Match Performance88.1 | 22 | |
| Reasoning | Big-bench Hard (BBH) | Exact Match (EM)53.53 | 20 | |
| Multitask Language Understanding | BIG-bench-lite 24 tasks | Score3,777 | 17 | |
| Generation | Big-Bench Hard (test) | Exact Match57.9 | 17 | |
| Various NLP tasks (NLU and Reasoning) | BIG-bench (unseen) | Known Unknowns Score86.96 | 15 | |
| Date Understanding | BIG-bench Hard Date Understanding (test) | Test Accuracy75.2 | 14 | |
| Reasoning | BIG-Bench Hard MIX-14K | Accuracy69.9 | 12 | |
| General Language Understanding | BIG-bench Mimicked | Sports Score99.7 | 11 | |
| General Language Understanding | BIG-bench Original | Sports Score99.4 | 11 | |
| Reasoning | BIG Bench Audio Speech Modality | Accuracy0.9341 | 10 | |
| Reasoning | BIG-Bench Extra Hard | Score37.8 | 10 | |
| Multi-task Language Understanding | BIG-bench | Hindu Knowledge80 | 10 | |
| Task-solving | BIG-Bench Hard (BBH) (test) | Causal Judgement68.3 | 10 | |
| Complex Multi-step Reasoning | Big-Bench Hard | Hard Accuracy85.7 | 9 | |
| Language Reasoning | BBH (BIG-Bench Hard) | Object Counting Score99.4 | 8 | |
| Complex Reasoning | BIG-bench Hard | Orig Score39.3 | 7 | |
| Algorithmic Reasoning | Big-Bench Hard Word Sorting and Multi-step Arithmetic (test) | WS Accuracy80 | 7 | |
| Multiple Choice Question Answering | BIG-bench HHH Eval | Overall Score87 | 7 | |
| Causal judgment | Big-Bench Hard | Accuracy69.5 | 6 |