| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Creative Writing | WildBench | WildBench Score83.9 | 45 | |
| Instruction Following | WildBench (test) | Info Seek58.6 | 27 | |
| Open-ended generation | WildBench | WildBench0.479 | 26 | |
| Subjective Evaluation | WildBench | Score0.8604 | 19 | |
| General Instruction Following | WildBench | Score92.6 | 19 | |
| Instruction Following | WildBench | WB Score63.18 | 18 | |
| Open-ended Generation | WildBench (test) | WildBench Score64.4 | 17 | |
| Creative Writing | WildBench (test) | WildBench Score64.4 | 15 | |
| Real-world Query Evaluation | WildBench | WildBench Accuracy71.5 | 14 | |
| General Chat | WildBench | LLM Judge Score68.16 | 12 | |
| General chat | WildBench 2025 (test) | WB-Elo1,062.4 | 12 | |
| Open-ended reasoning | WildBench | Creative Score57.05 | 5 | |
| Open-ended text generation | WildBench | Score-1.7 | 4 | |
| General Language Model Evaluation | WildBench | WildBench Score26.95 | 2 |