SOTA Human Evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
ESConv	KEMI	Win Rate70	36	3mo ago
VOVILLE	PAVE	Score6.18	9	2mo ago
NeurIPS ICML ICLR Proposals 2025 (test)	Stepwise CoT	Wins31	6	3mo ago
CulturalVQA OOD (test)	MMBoundary	Faithfulness7.66	6	4mo ago
ScienceVQA (test)	MMBoundary	Faithfulness Score8.35	6	4mo ago
A-OKVQA (test)	MMBoundary	Faithfulness Score7.83	6	4mo ago
Skywork (test)	AAD	Elo Rating1,610.1	5	1mo ago
DeepResearch Bench 20 reports (sampled)	PTAH	Readability (Win/Tie Rate)95	5	1mo ago
UltraFeedback 50 sampled questions	OTPO	Win Rate (Expert 1)62	5	4mo ago
CoVOMIX2-DIALOGUE-20S and CoVOMIX2-DIALOGUE-WILDREF mix	SCENA	Win Rate (SCENA Preferred)84.6	4	1mo ago
RCC-PVD (evaluation)	Ranking	Rank Preference Rate61	4	4mo ago
Human Evaluation	Ann Brown	Trustworthiness0.86	4	4mo ago
Human study blinded triplet comparison		Consistency Rank1.5	3	22d ago
Human Evaluation Rapport and UX	IPA	Rapport Score R13.88	3	1mo ago
Management subset	MENTOR	Win Rate93	3	1mo ago
Finance	MENTOR	Win Rate97	3	1mo ago
Education subset	MENTOR	Win Rate85	3	1mo ago
Human Evaluation Evil Players	GRAIL Agent	Contributed Success3.78	3	3mo ago
MathQA	Ours	Accuracy89.2	3	4mo ago
50 randomly selected model responses	GPT-4.1	Clarity98	3	4mo ago
Human Evaluation Set (test)	LongDPO	Win Rate0.65	3	4mo ago
200 human-generated instructions	Olympus	Success Rate0.865	3	4mo ago
HH dataset	RRHF_DP	Win Rate59	3	4mo ago
MS MARCO (test)	RBG	Preference: FiD18	3	4mo ago
Qwen-Image rollout results	HPSv3++	Win Rate77.5	2	1mo ago

Showing 25 of 33 rows