| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination evaluation | MMLU-Pro Law (test) | HALL%12.1 | 21 | |
| Academic Reasoning | MMLU-Pro | Pass@150.7 | 15 | |
| General Knowledge Reasoning | MMLU-Pro (test) | Accuracy37.72 | 10 | |
| LLM Routing | MMLU Pro Social Sciences (Out-of-Domain) | LPM59.2 | 7 | |
| LLM Routing | MMLU Pro Humanities Out-of-Domain | LPM51.74 | 7 | |
| General Knowledge Task | MMLU-Pro (test) | Accuracy56.3 | 6 | |
| General Question Answering | MMLU-Pro (test) | Mean Accuracy79.55 | 4 | |
| General | MMLU-Pro (test) | Accuracy83.76 | 4 | |
| General Capability | MMLU-Pro OpenR1-Math Harder | Accuracy71.3 | 3 | |
| General question answering | MMLU-Pro (test) | Optimization Token Usage595 | 3 | |
| General | MMLU-Pro (test) | Optimization Token Usage (k)778 | 3 | |
| Multiple-choice Question Answering | MMLU-Pro Overall (test) | Mean Entropy (R1)0.2456 | 3 | |
| Language Understanding | MMLU-Pro v1 (test) | Accuracy44.1 | 3 | |
| Question Answering | MMLU-Pro Adversarial Setting (test) | Accuracy98.9 | 2 | |
| Scientific Reasoning | MMLU-Pro | Mean@175.2 | 2 |