Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UGround: Towards Unified Visual Grounding with Unrolled Transformers

About

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou• 2025

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)
cIoU83.5
217
GUI GroundingScreenSpot v2
Avg Accuracy76.3
203
Referring Expression SegmentationRefCOCO+ (val)
cIoU72.8
201
Referring Expression SegmentationRefCOCO (testB)
cIoU77.7
191
Referring Expression SegmentationRefCOCO (val)
cIoU80.6
190
Referring Expression SegmentationRefCOCO+ (testA)
cIoU77.5
190
Referring Expression SegmentationRefCOCO+ (testB)
cIoU65.6
188
Reasoning SegmentationReasonSeg (val)
cIoU74.9
145
Generalized Referring Expression SegmentationgRefCOCO (testA)
cIoU70.87
115
Reasoning SegmentationReasonSeg (test)--
102
Showing 10 of 20 rows

Other info

Follow for update