| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction Following | Arena Hard | Win Rate98.11 | 263 | |
| LLM-as-a-judge | ARENA | Accuracy66.07 | 20 | |
| Conversational versatility | Arena-Hard | Win Rate61.16 | 20 | |
| Textual understanding | Arena-Hard | Win Rate67.1 | 17 | |
| Image Editing | Arena Analysis March 26, 2026 (test) | Arena ELO1,270 | 16 | |
| Human-centric Quality Evaluation | Arena-Hard | Arena-Hard Score28.8 | 15 | |
| Open-ended Generation | Arena-Hard | Score84.6 | 14 | |
| Pluralistic Reward Model Learning | ARENA | Accuracy (ARENA)60.56 | 10 | |
| Technical problem-solving | Arena Hard | Win Rate52.3 | 10 | |
| Personalized Dialogue | Arena-Hard | Arena Win Rate60 | 7 | |
| Preference-based Generation | Arena CW | Score36.5 | 6 | |
| Preference-based Generation | Arena HP | Score24.8 | 6 | |
| Alignment | Arena-Hard | Score48.1 | 5 | |
| Reward Modeling | Arena100K (test) | Table MSE0.2311 | 4 | |
| Alignment | Arena-Hard | Hard Prompt Gemini Score70.4 | 4 | |
| Human Preference Evaluation | Arena (Phase 2) | Total Battles200 | 3 | |
| Human Preference Evaluation | Arena Creative Writing | Win Rate23.4 | 3 | |
| Preference Prediction | Arena | Count of Significant Features (S)7 | 2 |