| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Knowledge-Intensive Reasoning | HLE | Avg Score85 | 75 | |
| Math Reasoning | HLE Math-100 | Pass@135.84 | 68 | |
| Reasoning | HLE | Accuracy (HLE Reasoning)40.8 | 63 | |
| Logical Reasoning | HLE | Accuracy0.7226 | 62 | |
| Long-horizon agentic task | HLE | Performance60 | 41 | |
| Reasoning | HLE | Score64.7 | 39 | |
| Multimodal Reasoning | HLE | Accuracy48.8 | 33 | |
| Scientific Reasoning | HLE (test) | Pass@149 | 25 | |
| High-Level Expert Knowledge Evaluation | HLE Gold 149 | Accuracy (Bio)80.5 | 25 | |
| HLE | HLE | Accuracy67.1 | 25 | |
| Humanities Question Answering | HLE | HLE Score13.37 | 24 | |
| General Reasoning | HLE | Accuracy38.4 | 21 | |
| General and STEM reasoning | HLE | Pass@18.12 | 20 | |
| Reasoning | HLE | Head-to-head Win %100 | 20 | |
| Scientific Reasoning | HLE | pass@1612 | 17 | |
| High-Level Reasoning | HLE | Average Score26.6 | 17 | |
| Reasoning | HLE | Accuracy50.2 | 16 | |
| Mathematical reasoning | HLE math | Accuracy23.3 | 16 | |
| Deep research | HLE | Accuracy51 | 16 | |
| Long-horizon agentic tasks | HLE Our Settings | Pass@144.4 | 15 | |
| Mathematical Reasoning | HLE decontaminated | Accuracy8.4 | 14 | |
| Reasoning | HLE OOD | Accuracy38.6 | 14 | |
| Reasoning | HLE (test) | Accuracy26 | 14 | |
| Deep Search | HLE text-only | Score40.8 | 14 | |
| Reasoning | HLE | Pass@118.03 | 14 |