| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | MMLU Accuracy79.6 | 73 | 1mo ago | ||
| MTBench | REPBEND | MTBench Score9.14 | 43 | 1mo ago | |
| BBH, GSM8K, MMLU, TruthfulQA, HumanEval, MBPP | ADG | Average Score26.77 | 30 | 5d ago | |
| All Benchmarks Overall | UltraMix | Overall Average Score52.04 | 29 | 5d ago | |
| 8 capability benchmarks Aggregate | Average Capability67.14 | 26 | 1mo ago | ||
| MMLU-Pro OpenR1-Math Harder | Accuracy71.3 | 3 | 1mo ago | ||
| GPQA-diamond OpenR1-Math Harder Subset | Accuracy54 | 3 | 1mo ago | ||
| ARC-c OpenR1-Math Harder | RePO | Accuracy70.6 | 3 | 1mo ago |