LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

About

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu• 2026

Related benchmarks

Task	Dataset	Result
Object Detection	LVIS v1.0 (val)	--	548
GUI Grounding	ScreenSpot Pro	Average Score60.3	482
GUI Grounding	ScreenSpot v2	--	447
Referring Expression Comprehension	RefCOCOg (test)	--	300
Object Detection	LVIS	--	59
Dense Object Detection	Dense200	F1 Score @ IoU 0.574	25
Object Pointing	VisDrone	F1@Point60.4	15
Object Pointing	RefCOCOg (val)	F1 Score0.913	13
Object Pointing	RefCOCOg (test)	F1 Score91	13
Dense Object Detection	VisDrone	F1@IoU 0.563	11

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord