Z3D: Zero-Shot 3D Visual Grounding from Images

About

3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at https://github.com/col14m/z3d .

Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi• 2026

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	Nr3D (test)	Overall Success Rate58.8	88
Visual Grounding	ScanRefer v1 (val)	Acc@0.5 (Unique)74.8	35
3D Visual Grounding	ScanRefer 250 scenes (test)	Acc@0.25 (Unique)87.9	7

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord