| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Knowledge | MMLU-Pro 5-shot | Knowledge Score (5-shot)44.65 | 37 | |
| Health | MMLU-Pro Health (FR) X (test) | Accuracy66.08 | 35 | |
| Hallucination evaluation | MMLU-Pro Law (test) | HALL%12.1 | 21 | |
| Mathematical Reasoning | MMLU-Pro Math | Accuracy77.6 | 18 | |
| Academic Reasoning | MMLU-Pro | Pass@150.7 | 15 | |
| Medical Question Answering | MMLU-Pro Health | Accuracy60.76 | 12 | |
| General Knowledge Reasoning | MMLU-Pro (test) | Accuracy37.72 | 10 | |
| Multiple Choice Question Answering | MMLU-Pro Psychology | Calibration Threshold (q-hat)0.983 | 8 | |
| Multiple Choice Question Answering | MMLU-Pro Health | Calibration Threshold (q-hat)99.2 | 8 | |
| Multiple Choice Question Answering | MMLU-Pro Chemistry | Calibration Threshold (q-hat)0.99 | 8 | |
| Multiple Choice Question Answering | MMLU-Pro Law | Calibration Threshold (q-hat)0.997 | 8 | |
| Multiple-choice Question Answering | MMLU-Pro Zipf 1.4 | Accuracy87.4 | 7 | |
| Multiple-choice Question Answering | MMLU-Pro Zipf 1.1 | Accuracy81.8 | 7 | |
| LLM Routing | MMLU Pro Social Sciences (Out-of-Domain) | LPM59.2 | 7 | |
| LLM Routing | MMLU Pro Humanities Out-of-Domain | LPM51.74 | 7 | |
| General Knowledge Task | MMLU-Pro (test) | Accuracy56.3 | 6 | |
| Science Question Answering | MMLU-Pro OOD | Accuracy (MMLU-Pro OOD Science)54.7 | 5 | |
| Query Routing | MMLU-Pro OOD | CPT Score (85%)74.4 | 4 | |
| Query Routing | MMLU-Pro OOD | CPT (80%)66.14 | 4 | |
| General Question Answering | MMLU-Pro (test) | Mean Accuracy79.55 | 4 | |
| General | MMLU-Pro (test) | Accuracy83.76 | 4 | |
| Multiple-choice Question Answering | MMLU-Pro (test) | Accuracy89.6 | 3 | |
| Multiple Choice Question Answering | MMLU-Pro law 1.0 (test) | Accuracy71.05 | 3 | |
| General Capability | MMLU-Pro OpenR1-Math Harder | Accuracy71.3 | 3 | |
| General question answering | MMLU-Pro (test) | Optimization Token Usage595 | 3 |