Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

About

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu• 2025

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy88.1
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy92.4
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.947
333
Referring Expression ComprehensionRefCOCOg (val)
Accuracy90.4
291
Referring Expression ComprehensionRefCOCOg (test)
Accuracy90.1
291
Referring Expression ComprehensionRefCOCO+ (test-A)
Accuracy92.2
172
Referring Expression ComprehensionRefCOCO+ (test-B)
Accuracy83.3
167
Referring Expression ComprehensionRefCOCO (test-B)
Accuracy89.8
160
Reasoning SegmentationReasonSeg (test)
gIoU62.2
102
Generalized Referring Expression SegmentationGRES gRefCOCO (test)
gIoU77.14
8
Showing 10 of 13 rows

Other info

Follow for update