Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Chatbot Arena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference PredictionChatbot Arena latest (test)
Accuracy72.18
51
Personalized Reward ModelingChatbot Arena Personalized
Accuracy75.92
42
Conversational AI EvaluationChatbot Arena
Rank1
40
Pairwise LLM JudgingChatbot Arena
Coverage100
16
LLM as a JudgeChatbot Arena (test)
Accuracy68.13
14
Judge AgreementChatbot Arena Random = 50% (S2)
Agreement96
10
Judge AgreementChatbot Arena Random = 33% (S1)
Agreement Rate72
10
Alignment with Human PreferencesChatbot Arena English-only
Spearman Correlation91.67
9
Correlation analysis with human preferencesChatbot Arena 15 LLMs after extension
Spearman Correlation0.9214
7
Open-ended text generationChatbot Arena inspired qualitative prompts (val)
ELO1,150.78
4
Showing 10 of 10 rows