Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Human Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Human Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
ESConv
KEMI
Win Rate
70
36
2mo ago
VOVILLE
PAVE
Score
6.18
9
14d ago
NeurIPS ICML ICLR Proposals 2025 (test)
Stepwise CoT
Wins
31
6
2mo ago
CulturalVQA OOD (test)
MMBoundary
Faithfulness
7.66
6
3mo ago
ScienceVQA (test)
MMBoundary
Faithfulness Score
8.35
6
3mo ago
A-OKVQA (test)
MMBoundary
Faithfulness Score
7.83
6
3mo ago
DeepResearch Bench 20 reports (sampled)
PTAH
Readability (Win/Tie Rate)
95
5
5d ago
UltraFeedback 50 sampled questions
OTPO
Win Rate (Expert 1)
62
5
3mo ago
RCC-PVD (evaluation)
Ranking
Rank Preference Rate
61
4
2mo ago
Human Evaluation
Ann Brown
Trustworthiness
0.86
4
3mo ago
Human Evaluation Evil Players
GRAIL Agent
Contributed Success
3.78
3
1mo ago
MathQA
Ours
Accuracy
89.2
3
3mo ago
50 randomly selected model responses
GPT-4.1
Clarity
98
3
3mo ago
Human Evaluation Set (test)
LongDPO
Win Rate
0.65
3
3mo ago
200 human-generated instructions
Olympus
Success Rate
0.865
3
3mo ago
HH dataset
RRHF_DP
Win Rate
59
3
3mo ago
MS MARCO (test)
RBG
Preference: FiD
18
3
3mo ago
WildChat
BACO
Quality Score
3.83
2
21h ago
NoveltyBench
BACO
Quality
4.04
2
21h ago
Hindi spoken dialogue (test)
Human
Naturalness
4.55
2
1mo ago
K/DA and K-OMG (50 random samples)
K-OMG
Overall Offensiveness Score
3.24
2
3mo ago
LongBench Chat
LongReward + DPO
Helpfulness Win Rate
14
1
3mo ago
WMT English-Czech 2019
binmt
Preference: Much Better
0.5
1
3mo ago
Tools 100 pairs
DiffLM
Win Rate
88
1
3mo ago
Showing 24 of 24 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs