Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Human Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Human Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
ESConv
KEMI
Win Rate
70
36
15d ago
NeurIPS ICML ICLR Proposals 2025 (test)
Stepwise CoT
Wins
31
6
18d ago
CulturalVQA OOD (test)
MMBoundary
Faithfulness
7.66
6
1mo ago
ScienceVQA (test)
MMBoundary
Faithfulness Score
8.35
6
1mo ago
A-OKVQA (test)
MMBoundary
Faithfulness Score
7.83
6
1mo ago
UltraFeedback 50 sampled questions
OTPO
Win Rate (Expert 1)
62
5
1mo ago
RCC-PVD (evaluation)
Ranking
Rank Preference Rate
61
4
22d ago
Human Evaluation
Ann Brown
Trustworthiness
0.86
4
1mo ago
Human Evaluation Evil Players
GRAIL Agent
Contributed Success
3.78
3
5d ago
MathQA
Ours
Accuracy
89.2
3
1mo ago
50 randomly selected model responses
GPT-4.1
Clarity
98
3
1mo ago
Human Evaluation Set (test)
LongDPO
Win Rate
0.65
3
1mo ago
200 human-generated instructions
Olympus
Success Rate
0.865
3
1mo ago
HH dataset
RRHF_DP
Win Rate
59
3
1mo ago
MS MARCO (test)
RBG
Preference: FiD
18
3
1mo ago
K/DA and K-OMG (50 random samples)
K-OMG
Overall Offensiveness Score
3.24
2
1mo ago
LongBench Chat
LongReward + DPO
Helpfulness Win Rate
14
1
1mo ago
WMT English-Czech 2019
binmt
Preference: Much Better
0.5
1
1mo ago
Tools 100 pairs
DiffLM
Win Rate
88
1
1mo ago
Showing 19 of 19 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs