Towards Cross-View Point Correspondence in Vision-Language Models
About
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Grounding | CrossPoint-Bench | Object Accuracy93.94 | 38 | |
| Correspondence-Pointing | CrossPoint-Bench | Object Accuracy84.3 | 19 | |
| Spatial Reasoning | CrossPoint-Bench | Score76.8 | 19 | |
| Visibility Reasoning | CrossPoint-Bench | Object Accuracy81.73 | 19 | |
| Spatial Reasoning | SPAR-Bench full | Average Score53.64 | 12 | |
| Spatial Understanding | CV-Bench v1 (test) | Relational Score94 | 11 | |
| Spatial Reasoning | SAT | Overall Acc78.33 | 11 | |
| Spatial Reasoning | SPAR-Bench tiny | -- | 7 |