| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Human Preference Ranking | Human Evaluation Elo (test) | Elo Score1,634 | 34 | |
| Visual Question Answering | Human Evaluation | Score1,090 | 21 | |
| Image Captioning | Human Evaluation | Score1,148 | 21 | |
| Multimodal Assessment | Human Evaluation | Score1,096 | 21 | |
| Interpretation Script Generation | Human Evaluation 10 book excerpts | Simplification5 | 12 | |
| Audiobook Audio Generation | Human Evaluation 10 book excerpts | Naturalness5 | 12 | |
| Summarization | Human Evaluation 1-5 scale | Coherence4.4 | 10 | |
| Text-to-Image Generation | Human Evaluation Total | Win Ratio85 | 10 | |
| Debating | Human Evaluation Debate | EA86.6 | 10 | |
| Personalized Image Generation | Human Evaluation 30 volunteers (test) | Win Rate7,082 | 8 | |
| Solution Simulation | Human Evaluation Solution Simulation (test) | Score3.75 | 8 | |
| Sentence Simplification | Human Evaluation 100-sentence sample (test) | Simplicity3.74 | 7 | |
| Multi-shot Cinematic Video Generation | Human Evaluation | VQE57.6 | 6 | |
| Machine Translation | Human Evaluation Average 2025 (test) | Avg Human Eval Score2.74 | 6 | |
| Machine Translation | Human Evaluation EN⇒ZH 2025 (test) | Human Evaluation Score2.61 | 6 | |
| Machine Translation | Human Evaluation ZH⇒EN 2025 (test) | Human Evaluation Score3.01 | 6 | |
| Co-speech Gesture Generation | Human Evaluation User Study | Naturalness3.71 | 6 | |
| Language Model Detoxification | Human Evaluation 50 generations (test) | Detoxification Count0.49 | 6 | |
| Text-to-Video Generation | Human evaluation | Visual Quality87 | 6 | |
| Subject and Motion Customization | Human Evaluation 50 groups: 5 motion patterns and 10 subjects | Text Alignment82.8 | 6 | |
| Critique Quality Evaluation | Human Evaluation Overall | Win Rate66 | 6 | |
| Text-guided Image Inpainting | Human Evaluation | Quality Score3.84 | 5 | |
| Text-to-Music Generation | Human Evaluation | Overall Preference Score41.08 | 5 | |
| Emotion Reasoning | Human evaluation 100-sample set | Factual Alignment3.7 | 5 | |
| Text Anonymization | Human Evaluation | PPP7.5 | 5 |