Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

About

Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou• 2026

Related benchmarks

TaskDatasetResultRank
Image EditingI2E-BENCH
Average Score7.42
6
Instruction-guided image editingMagicBrush
LPIPS-U0.0446
6
Instruction-guided image editingEmuEdit
LPIPS-U0.0565
6
Instruction-guided image editingI2E-BENCH 1.0 (test)
LPIPS-U0.0754
6
Showing 4 of 4 rows

Other info

Follow for update