SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning
About
Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | EditReward-Bench | PF68.3 | 17 | |
| Complex Multi-constraint Reasoning | MER-Bench Complex | 2-P Accuracy78 | 8 | |
| Image Editing Evaluation | MMRB2 ImgEdit | Single Score67.1 | 8 | |
| Image Editing | GEdit-Bench-EN official (test) | SC7.64 | 6 | |
| Image Editing | ImgEdit-Bench official (test) | Overall Score3.72 | 6 |