| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Logical Reasoning | HLE | Accuracy0.7226 | 46 | |
| HLE | HLE | Accuracy67.1 | 25 | |
| Long-horizon agentic task | HLE | Performance60 | 24 | |
| Humanities Question Answering | HLE | HLE Score13.37 | 24 | |
| Reasoning | HLE | Accuracy (HLE Reasoning)25.3 | 23 | |
| Knowledge-Intensive Reasoning | HLE | Avg Score85 | 23 | |
| General Reasoning | HLE | Accuracy38.4 | 21 | |
| Scientific Reasoning | HLE | pass@1612 | 17 | |
| High-Level Reasoning | HLE | Average Score26.6 | 17 | |
| Mathematical reasoning | HLE math | Accuracy23.3 | 16 | |
| Deep research | HLE | Accuracy51 | 16 | |
| Long-horizon agentic tasks | HLE Our Settings | Pass@144.4 | 15 | |
| Deep Search | HLE text-only | Score40.8 | 14 | |
| Reasoning | HLE | Pass@118.03 | 14 | |
| Deep Research | HLE text-only original (test) | Pass@132.9 | 13 | |
| Multi-domain knowledge reasoning | HLE 500-question ablation | Success Rate (Last)57.3 | 12 | |
| General Deep Research Tool Use | HLE | Success Rate42.9 | 12 | |
| High-level Multimodal Reasoning | HLE-500 | Text Score29.5 | 12 | |
| Hard Reasoning and Language Evaluation | HLE | Accuracy36.1 | 12 | |
| Mathematical Reasoning | HLE Math-text | Pass@162.8 | 12 | |
| Reasoning & General | HLE | Score51.8 | 11 | |
| Compositional Reasoning | HLE | Accuracy23.1 | 11 | |
| Long-horizon agentic tasks | HLE Full | Pass@145.8 | 10 | |
| Reasoning & General | HLE Full | Score (%)0.502 | 10 | |
| Hard LLM Reasoning | HLE | Accuracy15.5 | 10 |