| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Utility Set MMLU, BBH, TruthfulQA, TriviaQA, AlpacaEval | ELUDe | MMLU68.93 | 34 | 3d ago | |
| Average of 10 tasks | T-SPIN | Overall Performance45.02 | 12 | 1mo ago | |
| OlmoBaseEval HeldOut (LBPP, BBH, MMLU Pro, etc.) | Nemo. 3 Nano | LBPP Score33.7 | 10 | 12d ago | |
| Arena-Hard V2.0 | RM-NLHF | Win Rate7.03 | 9 | 1mo ago | |
| WildBench | PUGC | WildBench Score26.95 | 2 | 1mo ago |