| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ArenaHard v2.0 | GOLF | Win Rate52 | 12 | 2mo ago | |
| ArenaHard v1.0 | Win Rate82.75 | 12 | 2mo ago | ||
| WildBench | GOLF | LLM Judge Score68.16 | 12 | 2mo ago | |
| WildBench 2025 (test) | SR-GRPO | WB-Elo1,062.4 | 12 | 3mo ago | |
| AE2 LC | Qwen3-8B (Non-thinking) | Win Rate61.7 | 6 | 16d ago | |
| Arena Hard v0.1 | PMLE | Win Rate46.5 | 5 | 14d ago | |
| Arena-Hard Style-Controlled | PROSPER | Win-rate46.1 | 5 | 3mo ago | |
| Arena-Hard Vanilla | PROSPER | Win Rate0.492 | 5 | 3mo ago |