| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| AlpacaEval | GLM-4-Voice | AlpacaE51.06 | 16 | 1mo ago | |
| JudgeBench (test) | Skywork-Reward-V2-Llama-3.1-8B-40M | Knowledge79.9 | 16 | 1mo ago | |
| Curated Population (MATH-500, MMLU-Redux, SimpleQA) | gemini-2.5-pro | Accuracy82.57 | 15 | 1mo ago | |
| Arena-Hard v2 | Qwen3-8B + CE-RM-4B | Score18.2 | 14 | 1mo ago | |
| PandaLM | MILE-RefHumEval | Accuracy78.98 | 12 | 1mo ago | |
| HealthBench (test) | HealthBench Score (%)62.6 | 11 | 23d ago | ||
| AlpacaEval 2.0 | SpecEM | LC Win Rate51.32 | 10 | 1mo ago | |
| Arena-Hard v0.1 | Qwen3-8B + CE-RM-4B | Arena-Hard Score78.3 | 9 | 1mo ago | |
| Chinese FuseEval | SpecEM | Win Rate56.77 | 7 | 1mo ago | |
| FuseEval English | SpecEM | Win Rate55.46 | 7 | 1mo ago |