| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Expert-Level Reasoning | HLE (Humanity's Last Exam) text-only subset (val) | Inference Accuracy52.2 | 13 | |
| Performance Estimation | HLE (Humanity's Last Exam) 2% subset | MAE2.9 | 3 | |
| Performance Estimation | HLE (Humanity's Last Exam) 1% subset | MAE3.5 | 3 | |
| Performance Estimation | HLE (Humanity's Last Exam) 0.5% subset | MAE5.6 | 3 |