| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Humanities Question Answering | HLE | HLE Score13.37 | 24 | |
| Knowledge-Intensive Reasoning | HLE | Avg Score85 | 23 | |
| Logical Reasoning | HLE | Accuracy0.305 | 21 | |
| General Reasoning | HLE | Accuracy38.4 | 21 | |
| Scientific Reasoning | HLE | pass@1612 | 17 | |
| High-Level Reasoning | HLE | Average Score26.6 | 17 | |
| Deep research | HLE | Accuracy51 | 16 | |
| Deep Search | HLE text-only | Score40.8 | 14 | |
| Reasoning | HLE | Pass@118.03 | 14 | |
| Deep Research | HLE text-only original (test) | Pass@132.9 | 13 | |
| Hard Reasoning and Language Evaluation | HLE | Accuracy36.1 | 12 | |
| Mathematical Reasoning | HLE Math-text | Pass@162.8 | 12 | |
| Reasoning & General | HLE | Score51.8 | 11 | |
| Compositional Reasoning | HLE | Accuracy23.1 | 11 | |
| Reasoning & General | HLE Full | Score (%)0.502 | 10 | |
| Hard LLM Reasoning | HLE | Accuracy15.5 | 10 | |
| Question Answering | HLE | Performance Score17.6 | 8 | |
| Search | HLE text | Score45.8 | 7 | |
| Scientific Reasoning & QA | HLE | Accuracy3.61 | 7 | |
| Reasoning | HLE | Score17.9 | 7 | |
| Hard Reasoning | HLE | Pass@137.7 | 7 | |
| Multi-domain Knowledge and Reasoning | HLE (Humanity’s Last Exam) (official) | Exact Match42 | 7 | |
| Confidence Calibration | HLE (test) | ECE0.031 | 7 | |
| Agentic Reasoning | HLE | Overall Score41.6 | 7 | |
| Scientific Reasoning | HLE Text-only | Accuracy13.7 | 6 |