Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Blind to Position, Biased in Language: Probing Mid-Layer Representational Bias in Vision-Language Encoders for Zero-Shot Language-Grounded Spatial Understanding

About

Vision-Language Encoders (VLEs) are widely adopted as the backbone of zero-shot referring image segmentation (RIS), enabling text-guided localization without task-specific training. However, prior works underexplored the underlying biases within mid-layer representations that preserve positional and language-specific information. Through layer-wise investigation, we reveal that the conventionally used final-layer multimodal embeddings prioritize global semantic alignment, leading to two coupled consequences. First, vision embeddings exhibit weak sensitivity to positional cues. Second, multilingual text embeddings form language-dependent geometric shifts within the shared space. Motivated by these findings, we identify an underexplored pathway within VLE mid-layers to construct a spatial map, applicable for improving zero-shot RIS by 1-7 mIoU on nine RefCOCO benchmarks. Furthermore, leveraging mixed-language mid-layer embeddings yields enhanced spatial grounding accuracy (+7-8 mIoU and IoU@50), albeit with increased inference cost, and also improves performance on the zero-shot text-to-image retrieval task. Our work opens up the discussion about the effects of effective representational bias probing of VLEs for enhanced spatial grounding.

Na Min An, Inha Kang, Minhyun Lee, Hyunjung Shim• 2025

Related benchmarks

TaskDatasetResultRank
Referring Image SegmentationRefCOCO (val)
mIoU51.26
259
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Image SegmentationRefCOCO+ (test-B)
mIoU38.91
252
Referring Image SegmentationRefCOCO (test A)
mIoU57.44
230
Referring Expression SegmentationRefCOCO+ (testA)--
230
Referring Expression SegmentationRefCOCO+ (val)--
223
Referring Expression SegmentationRefCOCO (testB)--
213
Referring Expression SegmentationRefCOCO (val)--
212
Referring Expression SegmentationRefCOCO+ (testB)--
210
Referring Image SegmentationRefCOCO+ (val)
mIoU47.38
179
Showing 10 of 21 rows

Other info

Follow for update