| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning | BBEH | Accuracy78.8 | 64 | |
| Logical Reasoning | BBEH | Accuracy58.9 | 27 | |
| Reasoning | BBEH | pass@115.7 | 23 | |
| Causal Reasoning | BBEH | Accuracy (Causal Reasoning)55.2 | 14 | |
| Reasoning | BBEH (test) | Accuracy34.5 | 14 | |
| LLM Routing | BBEH (val) | Top-1 Acc66.4 | 14 | |
| LLM Routing | BBEH | Top-1 Accuracy66.4 | 14 | |
| Reasoning | BBEH mini | Pass@114.8 | 13 | |
| Reasoning | BBEH | Accuracy81.2 | 12 | |
| Algorithmic Reasoning | BBEH Mini | Accuracy17.8 | 11 | |
| Reasoning | BBEH | Accuracy75.8 | 7 | |
| Adding Mistake | BBEH | AOC67.2 | 7 | |
| Truncated CoT Answering | BBEH | AOC0.665 | 7 | |
| Logical Reasoning | BBEH mini | Accuracy17 | 6 | |
| Web of Lies | BBEH Web of Lies | Accuracy90.12 | 3 | |
| Dyck Languages | BBEH Dyck Languages | Accuracy91.25 | 3 | |
| Disambiguation QA | BBEH Disambiguation QA | Accuracy65.41 | 3 |