| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Reasoning | V*Bench | Accuracy95.7 | 58 | |
| Fine-grained Visual Question Answering | V*Bench | Overall Accuracy92.15 | 28 | |
| Visual Understanding | V* Bench | Avg@8 EM0.942 | 18 | |
| Fine-grained visual understanding | V* Bench | General Score85.5 | 18 | |
| Visual Question Answering | V*Bench | Accuracy98.95 | 17 | |
| Visual Grounding | V* Bench | Overall Success Rate95.7 | 14 | |
| Visual Search | V* Bench | Accuracy90.4 | 13 | |
| Visual Perception and Reasoning | V* Bench 1.0 (test) | Attribute Score83.48 | 13 | |
| Visual Perception | V*Bench | Accuracy84.3 | 9 | |
| Visual Tool-Use | V* Bench | Accuracy88.2 | 9 | |
| Text-to-Video Generation | V-Bench | Generation Speed (x)3.2 | 4 | |
| Visual Search | V*Bench | Success Rate75.3 | 2 |