| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Question Answering | V* | Accuracy74.35 | 45 | |
| High-resolution perception | V* | Overall Score89.53 | 26 | |
| Visual Reasoning | V* | Overall Score95.7 | 22 | |
| Visual Perception and Reasoning | V* | Overall Accuracy90.1 | 18 | |
| Visual Reasoning | V* | Accuracy90.2 | 18 | |
| Vision-Intensive Perception | V* Benchmark | Attr Score84.4 | 18 | |
| Semantic Segmentation | V20 | mIoU83.8 | 15 | |
| Visual Reasoning | V* cross-domain (test) | Accuracy79.06 | 15 | |
| High-resolution Visual Search | V* | Top-1 Accuracy86.91 | 13 | |
| Fine-grained visual reasoning | V* | Avg@8 Overall89.5 | 13 | |
| Visual Grounding | V* Relative Position 52 | Accuracy89.47 | 13 | |
| Visual Grounding | V* Direct Attributes 52 | Accuracy90.43 | 13 | |
| High-resolution Multi-modal Understanding | V* | Accuracy80.23 | 13 | |
| Fine-grained Perception | V* | Accuracy78.8 | 13 | |
| Visual Perception | V* | Score89 | 12 | |
| Visual Search | V* | Average Success90.6 | 11 | |
| Visual Reasoning | V* (test) | Overall Score92.2 | 11 | |
| Perception | V* (test) | Accuracy86.9 | 11 | |
| Visual Search | V* bench (test) | Attribute Rate87 | 10 | |
| Causal Discovery | V | Structural F180 | 9 | |
| Fine-grained Visual Reasoning | V* | Accuracy89 | 8 | |
| Multimodal Multi-choice | V* | Accuracy84.3 | 8 | |
| Visual Search and Comprehension | V* | Accuracy89.8 | 8 | |
| Delay Identification | V | Precision of Delay (POD)100 | 7 | |
| Multimodal reasoning | V* | Pass@189.5 | 7 |