Chatbot Arena

Benchmarks

Task Name	Dataset Name	SOTA Result
Human Preference Prediction	Chatbot Arena latest (test)	Accuracy72.18	51
Personalized Reward Modeling	Chatbot Arena Personalized	Accuracy75.92	42
Conversational AI Evaluation	Chatbot Arena	Rank1	40
LLM Judgement Confidence Estimation	Chatbot Arena (test)	RK0.3418	16
Pairwise LLM Judging	Chatbot Arena	Coverage100	16
LLM as a Judge	Chatbot Arena (test)	Accuracy68.13	14
LLM-as-a-judge	Chatbot Arena	Coverage94.3	12
Confidence Estimation	Chatbot Arena	Rank Correlation (RK)0.3524	11
Judge Agreement	Chatbot Arena Random = 50% (S2)	Agreement96	10
Judge Agreement	Chatbot Arena Random = 33% (S1)	Agreement Rate72	10
Binary/Pairwise Classification	Chatbot-Arena	Accuracy58	9
Alignment with Human Preferences	Chatbot Arena English-only	Spearman Correlation91.67	9
Correlation analysis with human preferences	Chatbot Arena 15 LLMs after extension	Spearman Correlation0.9214	7
Pairwise comparison evaluation	Chatbot Arena	Accuracy60.2	5
Open-ended text generation	Chatbot Arena inspired qualitative prompts (val)	ELO1,150.78	4

Showing 15 of 15 rows