Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Human Evaluation

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference RankingHuman Evaluation Elo (test)
Elo Score1,634
34
Visual Question AnsweringHuman Evaluation
Score1,090
21
Image CaptioningHuman Evaluation
Score1,148
21
Multimodal AssessmentHuman Evaluation
Score1,096
21
Text-to-Video GenerationHuman Evaluation 50 participants, 400 ratings (test)
Mean Score4.84
16
Video-to-Music GenerationHuman Evaluation (Scene Cut Videos)
Music Quality Win Rate81.54
14
Interpretation Script GenerationHuman Evaluation 10 book excerpts
Simplification5
12
Audiobook Audio GenerationHuman Evaluation 10 book excerpts
Naturalness5
12
SummarizationHuman Evaluation 1-5 scale
Coherence4.4
10
Text-to-Image GenerationHuman Evaluation Total
Win Ratio85
10
DebatingHuman Evaluation Debate
EA86.6
10
Creative Story GenerationHuman Evaluation Creative Stories LLaMA3.1-8B-Instruct (test)
Creativity Score7.57
9
Figurative-to-Literal SteeringHuman Evaluation (sample of 100)
Successful Sentences Count75
8
Literal-to-Figurative SteeringHuman Evaluation (sample of 100)
Successful Sentences15
8
Multimodal Content GenerationHuman Evaluation N=20 (test)
Win Count19
8
Image-to-video generationHuman Evaluation 40 LLM-generated prompts 1.0 (test)
Total ELO1,114.8
8
Question AnsweringFive-question human evaluation set
Relevance4.6
8
Personalized Image GenerationHuman Evaluation 30 volunteers (test)
Win Rate7,082
8
Text-to-Video GenerationHuman evaluation
Visual Quality87
8
Solution SimulationHuman Evaluation Solution Simulation (test)
Score3.75
8
Lip SynchronizationHuman Evaluation (User Study)
Quality Score4.78
7
Distractor GenerationHuman Evaluation Set (test)
Relevance4.14
7
Sentence SimplificationHuman Evaluation 100-sentence sample (test)
Simplicity3.74
7
Instruction Following with Long-term MemoryHuman Evaluation 1-10 scale (test)
Coherence8.7
6
Emotional Video CaptioningHuman Evaluation
Accuracy7.62
6
Showing 25 of 95 rows