SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

About

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao• 2026

Related benchmarks

Task	Dataset	Result
Image Manipulation Localization	AutoSplice	F1 Score63.42	24
Image Manipulation Localization	CocoGlide	F1 Score0.5872	24
Image Manipulation Localization	MagicBrush	F1 Score81.95	21
Image Manipulation Localization	DEAL-300K	F1 Score29.99	12
Image Manipulation Localization	OpenSDI	F1 Score27.02	12
Change Detection	AutoSplice	F1 Score96.69	10
Change Detection	DEAL-300K	F1 Score76.43	10
Change Detection	OpenSDI	F1 Score91.37	10
Change Detection	CocoGlide	F1 Score95.62	10
Change Detection	MagicBrush	F1 Score89.03	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord