WildBench

Benchmarks

Task Name	Dataset Name	SOTA Result
Creative Writing	WildBench	WildBench Score83.9	49
Writing	WildBench (test)	Score0.644	32
Instruction Following	WildBench (test)	Info Seek58.6	27
Open-ended generation	WildBench	WildBench0.479	26
Assistant Response Generation	WildBench v2	Win Rate68.4	20
Subjective Evaluation	WildBench	Score0.8604	19
General Instruction Following	WildBench	Score92.6	19
Instruction Following	WildBench	WB Score63.18	18
Open-ended Generation	WildBench (test)	WildBench Score64.4	17
Creative Writing	WildBench (test)	WildBench Score64.4	15
Real-world Query Evaluation	WildBench	WildBench Accuracy71.5	14
General Chat	WildBench	LLM Judge Score68.16	12
General chat	WildBench 2025 (test)	WB-Elo1,062.4	12
Instruction Following	WildBench 1.0 (test)	WB-Score37.98	8
LLM evaluation	WildBench v2	Quality Score64.9	6
Chatbot Evaluation	WildBench	Overall Score71.64	6
Open-ended reasoning	WildBench	Creative Score57.05	5
Open-ended text generation	WildBench	Score-1.7	4
Open-ended instruction following	WildBench v2	Win Rate58.1	3
General Language Model Evaluation	WildBench	WildBench Score26.95	2

Showing 20 of 20 rows