Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chatbot Arena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference PredictionChatbot Arena latest (test)
Accuracy72.18
51
Personalized Reward ModelingChatbot Arena Personalized
Accuracy75.92
42
Conversational AI EvaluationChatbot Arena
Rank1
40
Pairwise LLM JudgingChatbot Arena
Coverage100
16
LLM as a JudgeChatbot Arena (test)
Accuracy68.13
14
Judge AgreementChatbot Arena Random = 50% (S2)
Agreement96
10
Judge AgreementChatbot Arena Random = 33% (S1)
Agreement Rate72
10
Binary/Pairwise ClassificationChatbot-Arena
Accuracy58
9
Alignment with Human PreferencesChatbot Arena English-only
Spearman Correlation91.67
9
Correlation analysis with human preferencesChatbot Arena 15 LLMs after extension
Spearman Correlation0.9214
7
Pairwise comparison evaluationChatbot Arena
Accuracy60.2
5
Open-ended text generationChatbot Arena inspired qualitative prompts (val)
ELO1,150.78
4
Showing 12 of 12 rows