Share your thoughts, 1 month free Claude Pro on usSee more

Human Evaluation

Benchmarks

Task Name	Dataset Name	SOTA Result
Human Preference Ranking	Human Evaluation Elo (test)	Elo Score1,634	34
Visual Question Answering	Human Evaluation	Score1,090	21
Image Captioning	Human Evaluation	Score1,148	21
Multimodal Assessment	Human Evaluation	Score1,096	21
Text-to-Video Generation	Human Evaluation 50 participants, 400 ratings (test)	Mean Score4.84	16
Video-to-Music Generation	Human Evaluation (Scene Cut Videos)	Music Quality Win Rate81.54	14
Interpretation Script Generation	Human Evaluation 10 book excerpts	Simplification5	12
Audiobook Audio Generation	Human Evaluation 10 book excerpts	Naturalness5	12
Summarization	Human Evaluation 1-5 scale	Coherence4.4	10
Text-to-Image Generation	Human Evaluation Total	Win Ratio85	10
Debating	Human Evaluation Debate	EA86.6	10
Creative Story Generation	Human Evaluation Creative Stories LLaMA3.1-8B-Instruct (test)	Creativity Score7.57	9
Figurative-to-Literal Steering	Human Evaluation (sample of 100)	Successful Sentences Count75	8
Literal-to-Figurative Steering	Human Evaluation (sample of 100)	Successful Sentences15	8
Multimodal Content Generation	Human Evaluation N=20 (test)	Win Count19	8
Image-to-video generation	Human Evaluation 40 LLM-generated prompts 1.0 (test)	Total ELO1,114.8	8
Question Answering	Five-question human evaluation set	Relevance4.6	8
Personalized Image Generation	Human Evaluation 30 volunteers (test)	Win Rate7,082	8
Text-to-Video Generation	Human evaluation	Visual Quality87	8
Solution Simulation	Human Evaluation Solution Simulation (test)	Score3.75	8
Lip Synchronization	Human Evaluation (User Study)	Quality Score4.78	7
Distractor Generation	Human Evaluation Set (test)	Relevance4.14	7
Sentence Simplification	Human Evaluation 100-sentence sample (test)	Simplicity3.74	7
Video Generation and Editing	Human Evaluation	Consistency Score3.5	6
Instruction Following with Long-term Memory	Human Evaluation 1-10 scale (test)	Coherence8.7	6

Showing 25 of 117 rows