DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework

About

A core task in embodied intelligence is ego-centric 3D visual grounding. Existing methods typically adopt two-stage, heterogeneous pipelines that pair a detector with a separate grounding model. Incompatible decoders and box heads hinder the transfer of object-level priors, and the split training causes redundant re-optimization. To overcome these limitations, we present DEGround, a straight, elegant, and effective framework that centers on object-level sharing over detection and grounding. It employs a set of queries that serves as the common object representation for both detection and grounding, which is decoded by a shared transformer and bounding box head. Building on this homogeneous framework, we further introduce two task-specific plug-in modules to enhance fine-grained instruction grounding. The Regional Activation Grounding module improves spatial-textual alignment by highlighting instruction-relevant regions, while the Query-wise Modulation module applies sentence-conditioned affine modulation to generate instruction-aware queries at initialization. Extensive experiments demonstrate that DEGround achieves the best performance on multiple benchmarks. Remarkably, it significantly outperforms previous methods by 7.52% at overall precision on the EmbodiedScan dataset.

Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Xingping Dong• 2025

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	EmbodiedScan (Full)	Overall AP@2562.18	8
3D Detection	EmbodiedScan	Overall AP@2524.68	6
3D Visual Grounding	EmbodiedScan (test)	Overall AP5042.04	5
3D Visual Grounding	EmbodiedScan Mini	Overall AP@2561.28	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord