Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Human Evaluation

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference RankingHuman Evaluation Elo (test)
Elo Score1,634
34
Visual Question AnsweringHuman Evaluation
Score1,090
21
Image CaptioningHuman Evaluation
Score1,148
21
Multimodal AssessmentHuman Evaluation
Score1,096
21
Text-to-Video GenerationHuman Evaluation 50 participants, 400 ratings (test)
Mean Score4.84
16
Video-to-Music GenerationHuman Evaluation (Scene Cut Videos)
Music Quality Win Rate81.54
14
Interpretation Script GenerationHuman Evaluation 10 book excerpts
Simplification5
12
Audiobook Audio GenerationHuman Evaluation 10 book excerpts
Naturalness5
12
SummarizationHuman Evaluation 1-5 scale
Coherence4.4
10
Text-to-Image GenerationHuman Evaluation Total
Win Ratio85
10
DebatingHuman Evaluation Debate
EA86.6
10
Question AnsweringFive-question human evaluation set
Relevance4.6
8
Personalized Image GenerationHuman Evaluation 30 volunteers (test)
Win Rate7,082
8
Solution SimulationHuman Evaluation Solution Simulation (test)
Score3.75
8
Sentence SimplificationHuman Evaluation 100-sentence sample (test)
Simplicity3.74
7
Instruction Following with Long-term MemoryHuman Evaluation 1-10 scale (test)
Coherence8.7
6
Emotional Video CaptioningHuman Evaluation
Accuracy7.62
6
Action PredictionHuman Evaluation User Actions Dataset (test)
Win Rate79
6
Painting Quality EvaluationHuman Evaluation 51 participants (test)
Style Score3.38
6
Multi-shot Cinematic Video GenerationHuman Evaluation
VQE57.6
6
Machine TranslationHuman Evaluation Average 2025 (test)
Avg Human Eval Score2.74
6
Machine TranslationHuman Evaluation EN⇒ZH 2025 (test)
Human Evaluation Score2.61
6
Machine TranslationHuman Evaluation ZH⇒EN 2025 (test)
Human Evaluation Score3.01
6
Co-speech Gesture GenerationHuman Evaluation User Study
Naturalness3.71
6
Language Model DetoxificationHuman Evaluation 50 generations (test)
Detoxification Count0.49
6
Showing 25 of 85 rows