| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Human Preference Prediction | Chatbot Arena latest (test) | Accuracy72.18 | 51 | |
| Personalized Reward Modeling | Chatbot Arena Personalized | Accuracy75.92 | 42 | |
| Conversational AI Evaluation | Chatbot Arena | Rank1 | 40 | |
| Pairwise LLM Judging | Chatbot Arena | Coverage100 | 16 | |
| LLM as a Judge | Chatbot Arena (test) | Accuracy68.13 | 14 | |
| Judge Agreement | Chatbot Arena Random = 50% (S2) | Agreement96 | 10 | |
| Judge Agreement | Chatbot Arena Random = 33% (S1) | Agreement Rate72 | 10 | |
| Alignment with Human Preferences | Chatbot Arena English-only | Spearman Correlation91.67 | 9 | |
| Correlation analysis with human preferences | Chatbot Arena 15 LLMs after extension | Spearman Correlation0.9214 | 7 | |
| Open-ended text generation | Chatbot Arena inspired qualitative prompts (val) | ELO1,150.78 | 4 |