VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

About

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu• 2025

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy88.1	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy92.4	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.947	346
Referring Expression Comprehension	RefCOCOg (val)	Accuracy90.4	300
Referring Expression Comprehension	RefCOCOg (test)	Accuracy90.1	300
Reasoning Segmentation	ReasonSeg (test)	gIoU62.2	236
Referring Expression Comprehension	RefCOCO+ (test-A)	Accuracy92.2	172
Referring Expression Comprehension	RefCOCO+ (test-B)	Accuracy83.3	167
Referring Expression Comprehension	RefCOCO (test-B)	Accuracy89.8	160
Generalized Referring Expression Segmentation	GRES gRefCOCO (test)	gIoU77.14	8

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord