| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | MMLU Accuracy79.6 | 74 | 21d ago | ||
| Aggregate (GPQA-D, GSM8K, HumanEval, MATH-500, MBPP, MMLU-Pro) | FOREVER | Average Accuracy75.9 | 66 | 21d ago | |
| MTBench | REPBEND | MTBench Score9.14 | 43 | 3mo ago | |
| BBH, GSM8K, MMLU, TruthfulQA, HumanEval, MBPP | ADG | Average Score26.77 | 30 | 1mo ago | |
| All Benchmarks Overall | UltraMix | Overall Average Score52.04 | 29 | 1mo ago | |
| 8 capability benchmarks Aggregate | Average Capability67.14 | 26 | 3mo ago | ||
| OLMES benchmarks | Average Score51.4 | 9 | 26d ago | ||
| Aggregated Suite 7-metric average (test) | G-Zero | Average Score43.9 | 8 | 21d ago | |
| GPQA Diamond | PODS | Accuracy37.4 | 4 | 19d ago | |
| MMLU-Pro OpenR1-Math Harder | Accuracy71.3 | 3 | 3mo ago | ||
| GPQA-diamond OpenR1-Math Harder Subset | Accuracy54 | 3 | 3mo ago | ||
| ARC-c OpenR1-Math Harder | RePO | Accuracy70.6 | 3 | 3mo ago |