AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

About

Referring Image Segmentation (RIS) aims to segment the object in an image uniquely referred to by a natural language expression. However, RIS training often contains hard-to-align and instance-specific visual signals; optimizing on such pixels injects misleading gradients and drives the model in the wrong direction. By explicitly estimating pixel-level vision-language alignment, the learner can suppress low-alignment regions, concentrate on reliable cues, and acquire more generalizable alignment features. In this paper, we propose Alignment-Aware Masked Learning (AML), a simple yet effective training strategy that quantifies region-referent alignment (PMME) and filters out unreliable pixels during optimization (AFM). Specifically, each sample first computes a similarity map between visual and textual features, and then masks out pixels falling below an adaptive similarity threshold, thereby excluding poorly aligned regions from the training process. AML does not require architectural changes and incurs no inference overhead, directing attention to the areas aligned with the textual description. Experiments on the RefCOCO (vanilla/+/g) datasets show that AML achieves state-of-the-art results across all 8 splits, and beyond improving RIS performance, AML also enhances the model's robustness to diverse descriptions and scenarios. Code is available at https://github.com/pipashu1/AMLRIS.

Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo, Changbai Li, He Long, Chunyu Xie, Dawei Leng, Baochang Zhang• 2026

Related benchmarks

Task	Dataset	Result
Referring Image Segmentation	RefCOCO (val)	mIoU77.89	274
Referring Image Segmentation	RefCOCO+ (test-B)	mIoU64.61	267
Referring Image Segmentation	RefCOCO (test A)	mIoU79.53	245
Referring Image Segmentation	RefCOCO+ (val)	mIoU71.33	194
Referring Image Segmentation	RefCOCO (test-B)	mIoU74.99	186
Referring Image Segmentation	RefCOCOg (val)	oIoU68.84	114
Referring Image Segmentation	RefCOCO+ (testA)	mIoU75.61	112
Referring Image Segmentation	RefCOCOg (test)	oIoU70.01	75

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord