| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction Following | Arena Hard | Win Rate98.11 | 103 | |
| LLM-as-a-judge | ARENA | Accuracy66.07 | 20 | |
| Conversational versatility | Arena-Hard | Win Rate61.16 | 20 | |
| Image Editing | Arena Analysis March 26, 2026 (test) | Arena ELO1,270 | 16 | |
| Open-ended Generation | Arena-Hard | Score84.6 | 14 | |
| Pluralistic Reward Model Learning | ARENA | Accuracy (ARENA)60.56 | 10 | |
| Technical problem-solving | Arena Hard | Win Rate52.3 | 10 | |
| Alignment | Arena-Hard | Score48.1 | 5 | |
| Alignment | Arena-Hard | Hard Prompt Gemini Score70.4 | 4 | |
| Human Preference Evaluation | Arena Creative Writing | Win Rate23.4 | 3 | |
| Preference Prediction | Arena | Count of Significant Features (S)7 | 2 |