| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning & Understanding | VStar | Accuracy82.1 | 21 | |
| Fine-grained Visual Perception | VStar | VStar Score82.72 | 18 | |
| Visual Search and Reasoning | VStar | Score76.96 | 18 | |
| Multimodal Perception | VStar | Accuracy92.67 | 18 | |
| Visual Perception | VStar (test) | Accuracy92.7 | 15 | |
| Visual Understanding | VStar | Accuracy85.86 | 11 | |
| Perceptual Robustness | VStar | Overall Accuracy80.25 | 9 | |
| Video-grounded Dialogue Generation | VSTAR (test) | BLEU-10.092 | 9 | |
| Dialogue Topic Segmentation | VSTAR | WinDif0.765 | 7 | |
| Dialogue Scene Segmentation | VSTAR (test) | mIoU53.6 | 7 |