| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Question Answering | V*Bench | Accuracy98.95 | 84 | |
| Visual Reasoning | V*Bench | Accuracy95.7 | 58 | |
| Visual Perception and Reasoning | V* Bench | Attribute Score98.3 | 41 | |
| Visually Grounded Reasoning | V* Bench | Average Accuracy95.7 | 32 | |
| Visual Perception Reasoning | V* Bench | Score89.01 | 28 | |
| Fine-grained Visual Question Answering | V*Bench | Overall Accuracy92.15 | 28 | |
| Vision-Centric Reasoning | V* Bench (Overall) | Attribute Score96.5 | 24 | |
| Visual Search | V* Bench | Accuracy90.4 | 23 | |
| Real-World Understanding | V* Bench | Accuracy85.6 | 18 | |
| Visual Understanding | V* Bench | Avg@8 EM0.942 | 18 | |
| Fine-grained visual understanding | V* Bench | General Score85.5 | 18 | |
| Visually Grounded Reasoning | V* Bench (test) | Overall Accuracy95 | 17 | |
| Multimodal Reasoning | V* Bench Tool-needed | Accuracy90.1 | 15 | |
| Visual Grounding | V* Bench | Overall Success Rate95.7 | 14 | |
| Visual Perception and Reasoning | V* Bench 1.0 (test) | Attribute Score83.48 | 13 | |
| High-Resolution Perception | V*-Bench v1.0 (test) | Overall Score83.8 | 10 | |
| Fine-grained Visual Perception | V* Bench | Overall Score95.7 | 10 | |
| Visual Perception | V*Bench | Accuracy84.3 | 9 | |
| Visual Tool-Use | V* Bench | Accuracy88.2 | 9 | |
| Multimodal Question Answering | V* Bench | Answer Accuracy80.6 | 4 | |
| Text-to-Video Generation | V-Bench | Generation Speed (x)3.2 | 4 | |
| Visual Search | V*Bench | Success Rate75.3 | 2 |