Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

About

Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.

Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov, Dmitry Yudin, Maxim Monastyrny, Aleksei Valenkov• 2024

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5011.6	253
3D Open-set Semantic Segmentation	ScanNet 8 scenes	mAcc56	7
3D Open-set Semantic Segmentation	Replica 8 standard scenes	mAcc38	6
Text-based Object Retrieval	Sr3D	Acc@0.123	5
3D Object Grounding	Nr3D	Overall Accuracy (IoU=0.10)28.3	5
Scene Graph Generation	Replica 19 (Room0)	Generation Time (sec)56.5553	3
Scene Graph Generation	Replica 19 (Room1)	Generation Time (sec)45.9022	3
Scene Graph Generation	Replica 19 (Room2)	Latency (s)53.001	3
Scene Graph Generation	Replica office0 19	Latency (sec)35.6428	3
Scene Graph Generation	Replica 19 (office1)	Generation Time (sec)29.7156	3

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord