Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
About
This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code is available at~\url{https://github.com/PPjmchen/HAM}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5040.6 | 155 | |
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate48.2 | 88 | |
| Visual Grounding | ScanRefer v1 (val) | Acc@0.5 (All)40.6 | 30 | |
| 3D Visual Grounding | ScanRefer v1 (test) | Unique Acc@0.5IoU63.7 | 15 | |
| 3D Visual Grounding | ImputeRefer (test) | Unique IoU@0.2567.1 | 7 |