| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Qwen3-1.7B Evaluation Suite (avg) | Average Performance58.64 | 38 | 27d ago | ||
| AlpacaEval | GLM-4-Voice | AlpacaE51.06 | 16 | 2mo ago | |
| JudgeBench (test) | Skywork-Reward-V2-Llama-3.1-8B-40M | Knowledge79.9 | 16 | 3mo ago | |
| Curated Population (MATH-500, MMLU-Redux, SimpleQA) | gemini-2.5-pro | Accuracy82.57 | 15 | 2mo ago | |
| HuggingFace Open LLM Leaderboard Old (test) | GSM8K Score92.08 | 14 | 23d ago | ||
| Arena-Hard v2 | Qwen3-8B + CE-RM-4B | Score18.2 | 14 | 3mo ago | |
| PandaLM | MILE-RefHumEval | Accuracy78.98 | 12 | 3mo ago | |
| HealthBench (test) | HealthBench Score (%)62.6 | 11 | 2mo ago | ||
| Shared (evaluation) | GrowLoop | Tie-aware Accuracy78 | 10 | 5d ago | |
| AlpacaEval 2.0 | SpecEM | LC Win Rate51.32 | 10 | 2mo ago | |
| Arena-Hard v0.1 | Qwen3-8B + CE-RM-4B | Arena-Hard Score78.3 | 9 | 3mo ago | |
| Chinese FuseEval | SpecEM | Win Rate56.77 | 7 | 2mo ago | |
| FuseEval English | SpecEM | Win Rate55.46 | 7 | 2mo ago |