| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Human Preference Ranking | Human Evaluation Elo (test) | Elo Score1,634 | 34 | |
| Visual Question Answering | Human Evaluation | Score1,090 | 21 | |
| Image Captioning | Human Evaluation | Score1,148 | 21 | |
| Multimodal Assessment | Human Evaluation | Score1,096 | 21 | |
| Text-to-Video Generation | Human Evaluation 50 participants, 400 ratings (test) | Mean Score4.84 | 16 | |
| Video-to-Music Generation | Human Evaluation (Scene Cut Videos) | Music Quality Win Rate81.54 | 14 | |
| Interpretation Script Generation | Human Evaluation 10 book excerpts | Simplification5 | 12 | |
| Audiobook Audio Generation | Human Evaluation 10 book excerpts | Naturalness5 | 12 | |
| Summarization | Human Evaluation 1-5 scale | Coherence4.4 | 10 | |
| Text-to-Image Generation | Human Evaluation Total | Win Ratio85 | 10 | |
| Debating | Human Evaluation Debate | EA86.6 | 10 | |
| Question Answering | Five-question human evaluation set | Relevance4.6 | 8 | |
| Personalized Image Generation | Human Evaluation 30 volunteers (test) | Win Rate7,082 | 8 | |
| Solution Simulation | Human Evaluation Solution Simulation (test) | Score3.75 | 8 | |
| Sentence Simplification | Human Evaluation 100-sentence sample (test) | Simplicity3.74 | 7 | |
| Instruction Following with Long-term Memory | Human Evaluation 1-10 scale (test) | Coherence8.7 | 6 | |
| Emotional Video Captioning | Human Evaluation | Accuracy7.62 | 6 | |
| Action Prediction | Human Evaluation User Actions Dataset (test) | Win Rate79 | 6 | |
| Painting Quality Evaluation | Human Evaluation 51 participants (test) | Style Score3.38 | 6 | |
| Multi-shot Cinematic Video Generation | Human Evaluation | VQE57.6 | 6 | |
| Machine Translation | Human Evaluation Average 2025 (test) | Avg Human Eval Score2.74 | 6 | |
| Machine Translation | Human Evaluation EN⇒ZH 2025 (test) | Human Evaluation Score2.61 | 6 | |
| Machine Translation | Human Evaluation ZH⇒EN 2025 (test) | Human Evaluation Score3.01 | 6 | |
| Co-speech Gesture Generation | Human Evaluation User Study | Naturalness3.71 | 6 | |
| Language Model Detoxification | Human Evaluation 50 generations (test) | Detoxification Count0.49 | 6 |