| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy95.4 | 672 | |
| Logical Reasoning | BBH | Accuracy100 | 201 | |
| General Reasoning | BBH | BBH General Reasoning Accuracy94.6 | 98 | |
| Reasoning | BBH (test) | Accuracy62.06 | 67 | |
| Reasoning | BBH 3-shot | BBH 3-shot Score65.69 | 49 | |
| Complex Reasoning | BBH (val) | Accuracy65.81 | 42 | |
| Complex Reasoning | BBH | Accuracy85.93 | 40 | |
| Instruction Following | BBH | Accuracy67.1 | 40 | |
| Question Answering | BBH | Accuracy94.6 | 30 | |
| Logical Reasoning | BBH (test) | Top@1 Accuracy88.29 | 27 | |
| Common-sense Reasoning | BBH | Accuracy58.27 | 27 | |
| Symbolic and Logical Reasoning | BBH | Accuracy85.01 | 20 | |
| Benchmark Compression (Coreset selection) | BBH (full) | rho0.913 | 20 | |
| General Reasoning | BBH | Accuracy82.9 | 18 | |
| Reasoning and Classification | BBH (Big-Bench Hard) (unseen) | BBH Temporal Sequences98.8 | 17 | |
| Complex Reasoning | BBH | Acc83.03 | 16 | |
| Reasoning | BBH | BBH Pass@169.92 | 16 | |
| General Reasoning | BBH | Relative Cost1 | 14 | |
| Big-Bench Hard Reasoning | BBH | Accuracy69.16 | 14 | |
| General Reasoning | BBH | Accuracy (BBH)73.2 | 12 | |
| Hard Reasoning Tasks | BBH | BBH Accuracy (0-shot)52.1 | 12 | |
| Reasoning | BBH (unseen) | Total Average Score42.38 | 12 | |
| General Reasoning | BBH | Score88.7 | 12 | |
| Navigation Reasoning | BBH-Navigate (test) | Accuracy98 | 11 | |
| Reasoning | bbh-zh | Overall Score87.52 | 10 |