| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy95.4 | 507 | |
| Logical Reasoning | BBH | Accuracy100 | 93 | |
| General Reasoning | BBH | BBH General Reasoning Accuracy88.7 | 43 | |
| Complex Reasoning | BBH (val) | Accuracy65.81 | 42 | |
| Complex Reasoning | BBH | Accuracy85.93 | 40 | |
| Instruction Following | BBH | Accuracy67.1 | 40 | |
| Reasoning | BBH (test) | Accuracy59.5 | 40 | |
| Question Answering | BBH | Accuracy94.6 | 30 | |
| Logical Reasoning | BBH (test) | Top@1 Accuracy88.29 | 27 | |
| Common-sense Reasoning | BBH | Accuracy58.27 | 27 | |
| Benchmark Compression (Coreset selection) | BBH (full) | rho0.913 | 20 | |
| Reasoning | BBH | BBH Pass@169.92 | 16 | |
| Hard Reasoning Tasks | BBH | BBH Accuracy (0-shot)52.1 | 12 | |
| Reasoning | BBH (unseen) | Total Average Score42.38 | 12 | |
| Navigation Reasoning | BBH-Navigate (test) | Accuracy98 | 11 | |
| Reasoning | bbh-zh | Overall Score87.52 | 10 | |
| Helpfulness, Honesty, and Harmlessness Alignment Evaluation | BBH HHH | Harmlessness Score95 | 10 | |
| Comprehensive cognitive reasoning | BBH | BBH Comprehensive Reasoning Score40.65 | 8 | |
| Reasoning and Classification | BBH (Big-Bench Hard) (unseen) | BBH Boolean Expressions88.4 | 8 | |
| General Reasoning | BBH | Pass@155.59 | 8 | |
| Logical reasoning | BBH multiple-choice (first 1,000 samples) | Exact Match Accuracy86.2 | 7 | |
| Logical Deduction | BBH Logical Deduction (Seven Objects) (test) | Accuracy47.5 | 6 | |
| Navigation | BBH Navigation (test) | Accuracy83.1 | 6 | |
| Complex reasoning | BBH | BBH Solution Rate67.4 | 6 | |
| STEM | BBH | Accuracy70.8 | 6 |