PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation
About
3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D referring expression comprehension | ScanRefer | Overall@0.25 Accuracy58.47 | 21 | |
| 3D Referring Expression Segmentation | ScanRefer | mIoU46.39 | 16 | |
| 3D Referring Expression Segmentation | Sr3D | Acc@0.2570.95 | 11 | |
| 3DREC | Nr3D | Accuracy (0.25 IoU)59.91 | 9 | |
| 3D Referring Expression Segmentation (3DRES) | ScanRefer Multiple subset (val) | Overall Accuracy @0.2555.33 | 7 | |
| 3D Referring Expression Segmentation (3DRES) | ScanRefer Implicit (val) | Overall Accuracy (IoU=0.25)62.15 | 5 | |
| 3D referring expression comprehension | Sr3D | Accuracy @ IoU=0.2570.95 | 2 | |
| 3D referring expression comprehension | Nr3D | Accuracy @ IoU=0.2559.91 | 2 | |
| 3D Referring Expression Comprehension (3DREC) | ScanRefer Implicit subset (val) | Accuracy@IoU=0.2560.76 | 2 | |
| 3D Referring Expression Segmentation | Nr3D | Acc@0.2557.56 | 2 |