| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy95.4 | 726 | |
| Logical Reasoning | BBH | Accuracy100 | 249 | |
| General Reasoning | BBH | Accuracy93.2 | 190 | |
| General Reasoning | BBH | BBH General Reasoning Accuracy94.6 | 103 | |
| Reasoning | BBH (test) | Accuracy73.9 | 94 | |
| Complex Reasoning | BBH | Accuracy90.5 | 85 | |
| Instruction Induction | BBH Induct | Accuracy91.3 | 80 | |
| Reasoning | BBH 3-shot | BBH 3-shot Score65.69 | 49 | |
| Reasoning | BBH | BBH Pass@183.69 | 49 | |
| Complex Reasoning | BBH (val) | Accuracy65.81 | 42 | |
| Causal Reasoning | BBH Causal Judgement | Accuracy (BBH Causal Judgement)78 | 40 | |
| Instruction Following | BBH | Accuracy67.1 | 40 | |
| Reasoning | BBH | BBH Score84.5 | 39 | |
| Reasoning | BBH | Score81.1 | 36 | |
| Spatial Reasoning | BBH Navigate | Accuracy@198 | 33 | |
| Question Answering | BBH | Accuracy94.6 | 33 | |
| Logical Reasoning | BBH (test) | Top@1 Accuracy88.29 | 29 | |
| Deductive Reasoning | BBH Ded. | Accuracy92.5 | 28 | |
| Common-sense Reasoning | BBH | Accuracy58.27 | 27 | |
| Instruction Tuning | BBH | Accuracy (BBH)66.2 | 24 | |
| Logical Deduction | BBH Logical Deduction (Seven Objects) (test) | Accuracy55.2 | 22 | |
| Common Sense Reasoning | BBH Sports Understanding | Accuracy (BBH Sports)88 | 21 | |
| Symbolic and Logical Reasoning | BBH | Accuracy85.01 | 20 | |
| Benchmark Compression (Coreset selection) | BBH (full) | rho0.913 | 20 | |
| tracking shuffled objects seven objects | BBH (test) | Accuracy92.8 | 20 |