| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| LLM Alignment Evaluation | Arena-Hard | Win Rate42.7 | 67 | |
| Language Model Alignment Evaluation | Arena-Hard v0.1 | Win Rate (%)35.2 | 36 | |
| General Instruction Following | Arena-Hard | Score22.1 | 35 | |
| Creative Writing | Arena-Hard Creative Writing v2 | Score90.8 | 25 | |
| General Instruction Following | Arena-Hard v2 | Score85.9 | 23 | |
| Instruction Following | Arena Hard v0.1 | Score37.9 | 16 | |
| LLM Evaluation | Arena-Hard v2 | Score18.2 | 14 | |
| Open-domain task | Arena-Hard (test) | Error12.61 | 12 | |
| Open-domain task | Arena-Hard | Error (%)5.17 | 12 | |
| Preference Modeling | Arena-Hard V2 | Win Rate73.2 | 9 | |
| General Language Model Evaluation | Arena-Hard V2.0 | Win Rate7.03 | 9 | |
| LLM Evaluation | Arena-Hard v0.1 | Arena-Hard Score78.3 | 9 | |
| Chat Preference | Arena Hard v2 | Score79.9 | 8 | |
| General Writing | Arena-Hard Creative Writing | Score93.6 | 6 | |
| General Writing | Arena-Hard Prompt | Score72.6 | 6 | |
| General Chat | Arena-Hard Style-Controlled | Win-rate46.1 | 5 | |
| General Chat | Arena-Hard Vanilla | Win Rate0.492 | 5 | |
| Reward Model Evaluation | Arena-Hard RU | Best@8 Score92.69 | 5 | |
| Open-ended text generation | Arena-hard Creative-Writing | Pairwise Win Rate80.2 | 4 | |
| Open-ended text generation | Arena-hard Hard-Prompt | Pairwise Win Rate58.5 | 4 | |
| Creative Writing | Arena Hard | Win Rate63.5 | 4 | |
| Human Preference Evaluation | Arena-Hard v0.1 | Win Rate56.7 | 3 |