Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Human Evaluation

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference RankingHuman Evaluation Elo (test)
Elo Score1,634
34
Visual Question AnsweringHuman Evaluation
Score1,090
21
Image CaptioningHuman Evaluation
Score1,148
21
Multimodal AssessmentHuman Evaluation
Score1,096
21
Interpretation Script GenerationHuman Evaluation 10 book excerpts
Simplification5
12
Audiobook Audio GenerationHuman Evaluation 10 book excerpts
Naturalness5
12
SummarizationHuman Evaluation 1-5 scale
Coherence4.4
10
Text-to-Image GenerationHuman Evaluation Total
Win Ratio85
10
DebatingHuman Evaluation Debate
EA86.6
10
Personalized Image GenerationHuman Evaluation 30 volunteers (test)
Win Rate7,082
8
Solution SimulationHuman Evaluation Solution Simulation (test)
Score3.75
8
Sentence SimplificationHuman Evaluation 100-sentence sample (test)
Simplicity3.74
7
Multi-shot Cinematic Video GenerationHuman Evaluation
VQE57.6
6
Machine TranslationHuman Evaluation Average 2025 (test)
Avg Human Eval Score2.74
6
Machine TranslationHuman Evaluation EN⇒ZH 2025 (test)
Human Evaluation Score2.61
6
Machine TranslationHuman Evaluation ZH⇒EN 2025 (test)
Human Evaluation Score3.01
6
Co-speech Gesture GenerationHuman Evaluation User Study
Naturalness3.71
6
Language Model DetoxificationHuman Evaluation 50 generations (test)
Detoxification Count0.49
6
Text-to-Video GenerationHuman evaluation
Visual Quality87
6
Subject and Motion CustomizationHuman Evaluation 50 groups: 5 motion patterns and 10 subjects
Text Alignment82.8
6
Critique Quality EvaluationHuman Evaluation Overall
Win Rate66
6
Text-guided Image InpaintingHuman Evaluation
Quality Score3.84
5
Text-to-Music GenerationHuman Evaluation
Overall Preference Score41.08
5
Emotion ReasoningHuman evaluation 100-sample set
Factual Alignment3.7
5
Text AnonymizationHuman Evaluation
PPP7.5
5
Showing 25 of 65 rows