| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | Humanity's Last Exam | Accuracy84.61 | 46 | |
| Question Answering | Humanity's Last Exam | Pass@151.7 | 16 | |
| Expert-level Reasoning | Humanity's Last Exam 2,158 text-only | Avg@3 Score54.2 | 15 | |
| Expert-Level Question Answering | Humanity's Last Exam | Accuracy40.9 | 14 | |
| Complex Reasoning | Humanity's Last Exam (HLE) | Pass@1 Score18.4 | 13 | |
| Question Answering | Humanity's Last Exam (HLE) MCQ | Accuracy19.9 | 6 | |
| Long Context Evaluation | Humanity's Last Exam AA-LCR | Accuracy54.3 | 6 | |
| World Knowledge | HUMANITY’S LAST EXAM text-only | Score11.1 | 4 |