| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| LLM Alignment Evaluation | Arena-Hard | Win Rate42.7 | 73 | |
| General Instruction Following | Arena-Hard | Score22.1 | 46 | |
| General Instruction Following | Arena-Hard v2 | Score85.9 | 37 | |
| Language Model Alignment Evaluation | Arena-Hard v0.1 | Win Rate (%)35.2 | 36 | |
| LLM Alignment Evaluation | Arena-Hard v0.1 | Win Rate50 | 31 | |
| Creative Writing | Arena-Hard Creative Writing v2 | Score90.8 | 25 | |
| Instruction Following | Arena-Hard Vanilla | Instruction Following Score57.5 | 19 | |
| Creative Writing | Arena Hard | Win Rate63.5 | 18 | |
| Instruction Following | Arena-Hard Style-Con | Score57.7 | 17 | |
| General Chat Evaluation | Arena-Hard | Win Rate84 | 16 | |
| Instruction Following | Arena Hard v0.1 | Score37.9 | 16 | |
| Downstream Policy Performance | Arena-Hard v2.0 | Win Rate33.9 | 14 | |
| LLM Evaluation | Arena-Hard v2 | Score18.2 | 14 | |
| Complex reasoning | Arena-Hard 2.0 (test) | Overall Accuracy52.9 | 12 | |
| Open-domain task | Arena-Hard (test) | Error12.61 | 12 | |
| Open-domain task | Arena-Hard | Error (%)5.17 | 12 | |
| Conversational Skill Evaluation | Arena-Hard | Win Rate (%)32.6 | 11 | |
| Chat Preference | Arena Hard v2 | Score79.9 | 10 | |
| Chat Quality Evaluation | Arena-Hard vs gpt-4-0314 (test) | Win Rate57.6 | 9 | |
| Preference Modeling | Arena-Hard V2 | Win Rate73.2 | 9 | |
| General Language Model Evaluation | Arena-Hard V2.0 | Win Rate7.03 | 9 | |
| LLM Evaluation | Arena-Hard v0.1 | Arena-Hard Score78.3 | 9 | |
| Open-ended Generation | Arena-Hard v2.0 | Score47.8 | 8 | |
| Instruction following | Arena-Hard v2 (test) | AH2 Score1.3 | 8 | |
| LLM Chat Evaluation | Arena-Hard v0.1 (test) | Win Rate40.1 | 6 |