Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios
About
With the large models easing the labor-intensive manipulation process, image manipulations in today's real scenarios often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, which forces the model to discard essential structural cues and ultimately leads to overfitting and degraded generalization. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step's prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show that: 1) RITA achieves SOTA generalization and robustness on traditional benchmarks; 2) it remains computationally efficient despite explicitly modeling multi-step sequences; and 3) it establishes a viable foundation for hierarchical, process-aware manipulation localization. Code and dataset are available at https://github.com/scu-zjz/RITA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Manipulation Localization | CAT-Net evaluation protocol (test) | Mean Binary F164.3 | 84 | |
| Image Manipulation Localization | Coverage | F1 Score56.6 | 49 | |
| Image Manipulation Localization | CAT-Net (test) | Mean Binary F164.3 | 42 | |
| Image Manipulation Localization | Columbia | F1 Score92.1 | 42 | |
| Image Manipulation Localization | CASIA v1 | F1 Score77 | 36 | |
| Image Manipulation Localization | CocoGlide | F1 Score53.3 | 12 | |
| Image Manipulation Localization | AutoSplice | F1 Score66.4 | 12 | |
| Image Manipulation Localization | IMD 2020 | F1 Score37.9 | 6 | |
| Image Manipulation Localization | HSIM (test) | Parameters (M)55.567 | 6 |