| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reverse Chain-of-Thought Generation | ArenaHard | Score72 | 20 | |
| Instruction Following Evaluation | ArenaHard v1 | ArenaHardv1 Score38 | 14 | |
| Creative Writing | ArenaHard creative writing v2.0 | WR Score29 | 13 | |
| Instruction Following | ArenaHard Creative Writing 2.0 | Win Rate61.9 | 12 | |
| Instruction Following | ArenaHard Hard Prompts 2.0 | Win Rate32.7 | 12 | |
| General Chat | ArenaHard v2.0 | Win Rate52 | 12 | |
| General Chat | ArenaHard v1.0 | Win Rate82.75 | 12 | |
| General Reasoning and Creative Writing | ArenaHard v2 | Hard Prompt Score15.5 | 8 | |
| Alignment | ArenaHard | pass@195.7 | 7 | |
| Human Preference Alignment | ArenaHard V2 | Avg@3 Score60 | 6 | |
| Alignment & Instruction Following | ArenaHard Hard Prompt v2 | Pass@188.2 | 4 | |
| Chatbot Evaluation | ArenaHard v2 | Hard Prompt Accuracy14 | 4 | |
| Alignment & Instruction Following | ArenaHard Creative Writing v2 | Pass@178.7 | 3 | |
| Alignment & Instruction Following | ArenaHard Avg. v2 | Pass@183.5 | 3 |