Blind to Position, Biased in Language: Probing Mid-Layer Representational Bias in Vision-Language Encoders for Zero-Shot Language-Grounded Spatial Understanding
About
Vision-Language Encoders (VLEs) are widely adopted as the backbone of zero-shot referring image segmentation (RIS), enabling text-guided localization without task-specific training. However, prior works underexplored the underlying biases within mid-layer representations that preserve positional and language-specific information. Through layer-wise investigation, we reveal that the conventionally used final-layer multimodal embeddings prioritize global semantic alignment, leading to two coupled consequences. First, vision embeddings exhibit weak sensitivity to positional cues. Second, multilingual text embeddings form language-dependent geometric shifts within the shared space. Motivated by these findings, we identify an underexplored pathway within VLE mid-layers to construct a spatial map, applicable for improving zero-shot RIS by 1-7 mIoU on nine RefCOCO benchmarks. Furthermore, leveraging mixed-language mid-layer embeddings yields enhanced spatial grounding accuracy (+7-8 mIoU and IoU@50), albeit with increased inference cost, and also improves performance on the zero-shot text-to-image retrieval task. Our work opens up the discussion about the effects of effective representational bias probing of VLEs for enhanced spatial grounding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Image Segmentation | RefCOCO (val) | mIoU51.26 | 259 | |
| Referring Expression Segmentation | RefCOCO (testA) | -- | 257 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU38.91 | 252 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU57.44 | 230 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | -- | 230 | |
| Referring Expression Segmentation | RefCOCO+ (val) | -- | 223 | |
| Referring Expression Segmentation | RefCOCO (testB) | -- | 213 | |
| Referring Expression Segmentation | RefCOCO (val) | -- | 212 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | -- | 210 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU47.38 | 179 |