Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chatbot Arena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference PredictionChatbot Arena latest (test)
Accuracy72.18
51
Personalized Reward ModelingChatbot Arena Personalized
Accuracy75.92
42
Conversational AI EvaluationChatbot Arena
Rank1
40
LLM Judgement Confidence EstimationChatbot Arena (test)
RK0.3418
16
Pairwise LLM JudgingChatbot Arena
Coverage100
16
LLM as a JudgeChatbot Arena (test)
Accuracy68.13
14
LLM-as-a-judgeChatbot Arena
Coverage94.3
12
Confidence EstimationChatbot Arena
Rank Correlation (RK)0.3524
11
Judge AgreementChatbot Arena Random = 50% (S2)
Agreement96
10
Judge AgreementChatbot Arena Random = 33% (S1)
Agreement Rate72
10
Binary/Pairwise ClassificationChatbot-Arena
Accuracy58
9
Alignment with Human PreferencesChatbot Arena English-only
Spearman Correlation91.67
9
Correlation analysis with human preferencesChatbot Arena 15 LLMs after extension
Spearman Correlation0.9214
7
Pairwise comparison evaluationChatbot Arena
Accuracy60.2
5
Open-ended text generationChatbot Arena inspired qualitative prompts (val)
ELO1,150.78
4
Showing 15 of 15 rows