| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Human Preference Ranking | Human Evaluation Elo (test) | Elo Score1,634 | 34 | |
| Visual Question Answering | Human Evaluation | Score1,090 | 21 | |
| Image Captioning | Human Evaluation | Score1,148 | 21 | |
| Multimodal Assessment | Human Evaluation | Score1,096 | 21 | |
| Text-to-Video Generation | Human Evaluation 50 participants, 400 ratings (test) | Mean Score4.84 | 16 | |
| Video-to-Music Generation | Human Evaluation (Scene Cut Videos) | Music Quality Win Rate81.54 | 14 | |
| Interpretation Script Generation | Human Evaluation 10 book excerpts | Simplification5 | 12 | |
| Audiobook Audio Generation | Human Evaluation 10 book excerpts | Naturalness5 | 12 | |
| Summarization | Human Evaluation 1-5 scale | Coherence4.4 | 10 | |
| Text-to-Image Generation | Human Evaluation Total | Win Ratio85 | 10 | |
| Debating | Human Evaluation Debate | EA86.6 | 10 | |
| Creative Story Generation | Human Evaluation Creative Stories LLaMA3.1-8B-Instruct (test) | Creativity Score7.57 | 9 | |
| Figurative-to-Literal Steering | Human Evaluation (sample of 100) | Successful Sentences Count75 | 8 | |
| Literal-to-Figurative Steering | Human Evaluation (sample of 100) | Successful Sentences15 | 8 | |
| Multimodal Content Generation | Human Evaluation N=20 (test) | Win Count19 | 8 | |
| Image-to-video generation | Human Evaluation 40 LLM-generated prompts 1.0 (test) | Total ELO1,114.8 | 8 | |
| Question Answering | Five-question human evaluation set | Relevance4.6 | 8 | |
| Personalized Image Generation | Human Evaluation 30 volunteers (test) | Win Rate7,082 | 8 | |
| Text-to-Video Generation | Human evaluation | Visual Quality87 | 8 | |
| Solution Simulation | Human Evaluation Solution Simulation (test) | Score3.75 | 8 | |
| Lip Synchronization | Human Evaluation (User Study) | Quality Score4.78 | 7 | |
| Distractor Generation | Human Evaluation Set (test) | Relevance4.14 | 7 | |
| Sentence Simplification | Human Evaluation 100-sentence sample (test) | Simplicity3.74 | 7 | |
| Instruction Following with Long-term Memory | Human Evaluation 1-10 scale (test) | Coherence8.7 | 6 | |
| Emotional Video Captioning | Human Evaluation | Accuracy7.62 | 6 |