Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework

About

A core task in embodied intelligence is ego-centric 3D visual grounding. Existing methods typically adopt two-stage, heterogeneous pipelines that pair a detector with a separate grounding model. Incompatible decoders and box heads hinder the transfer of object-level priors, and the split training causes redundant re-optimization. To overcome these limitations, we present DEGround, a straight, elegant, and effective framework that centers on object-level sharing over detection and grounding. It employs a set of queries that serves as the common object representation for both detection and grounding, which is decoded by a shared transformer and bounding box head. Building on this homogeneous framework, we further introduce two task-specific plug-in modules to enhance fine-grained instruction grounding. The Regional Activation Grounding module improves spatial-textual alignment by highlighting instruction-relevant regions, while the Query-wise Modulation module applies sentence-conditioned affine modulation to generate instruction-aware queries at initialization. Extensive experiments demonstrate that DEGround achieves the best performance on multiple benchmarks. Remarkably, it significantly outperforms previous methods by 7.52% at overall precision on the EmbodiedScan dataset.

Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Xingping Dong• 2025

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingEmbodiedScan (Full)
Overall AP@2562.18
8
3D DetectionEmbodiedScan
Overall AP@2524.68
6
3D Visual GroundingEmbodiedScan (test)
Overall AP5042.04
5
3D Visual GroundingEmbodiedScan Mini
Overall AP@2561.28
4
Showing 4 of 4 rows

Other info

Follow for update