ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

About

3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.

Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He• 2025

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer	Acc@0.529	142
3D Visual Grounding	Nr3D	Overall Success Rate69.9	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate69.9	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy76	73
3D Visual Grounding	ScanRefer Unique	Acc@0.25 (IoU=0.25)82.1	41
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2545.4	41
3D Visual Grounding	ScanRefer (test)	Unique Accuracy82.1	21

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord