| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Spatial Reasoning | VSI-Bench | Avg Score79.2 | 192 | |
| Spatial Reasoning | VSI-Bench 1.0 (test) | Relative Distance Error25 | 80 | |
| Visual Spatial Intelligence | VSI-Bench | Average Score79.2 | 48 | |
| 3D Question Answering | VSI-Bench | Average Score63.2 | 37 | |
| Video spatial reasoning | VSI-Bench | Average Score79.2 | 32 | |
| Spatial Reasoning | VSI-Bench (test) | Avg Score62.8 | 21 | |
| Visual Spatial Understanding | VSI-Bench static 1.0 | Average Score89.5 | 20 | |
| Spatial Reasoning | VSI-Bench Vanilla regime | Avg Score50.5 | 19 | |
| Multi-view global spatial scene understanding | VSI-Bench | Relative Distance Score94.7 | 16 | |
| Spatial Reasoning | VSI-Bench tiny | Route Plan46.94 | 15 | |
| Spatial Reasoning | VSI-Bench 59 (test) | Object Count Score72.5 | 14 | |
| Spatial Reasoning (Video) | VSI-Bench | Accuracy68.3 | 14 | |
| Fine-grained video-based spatial reasoning | VSI-Bench | Avg Score60.6 | 13 | |
| Video Spatial Intelligence | VSI-Bench 123 (test) | Object Count70 | 13 | |
| Video Understanding | VSI-Bench 67 (test) | Average Score52.7 | 12 | |
| Spatial Reasoning | VSI-Bench MV 76 | Accuracy63.7 | 11 | |
| Video Understanding | VSI-Bench | Accuracy49.5 | 11 | |
| multi-view Visual Question Answering | VSI-Bench (test) | Average Score52.9 | 11 | |
| Video Scene Identification | VSI-Bench | Accuracy38.2 | 10 | |
| 3D Spatial Visual Question Answering | VSI-Bench ARKit | Average Score51.3 | 8 | |
| Video Scene Interaction | VSI-Bench | Accuracy35.8 | 6 | |
| Visual Question Answering | VSI-Bench | Accuracy70.6 | 2 |