HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
About
High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Inpainting | High-resolution Image Editing 25% Edit Ratio | Latency (s)4.51 | 18 | |
| Image Inpainting | High-resolution Image Editing 50% Edit Ratio | Latency (s)6.74 | 18 | |
| Image Inpainting | High-resolution Image Editing 75% Edit Ratio | Latency (s)8.32 | 18 | |
| Instruction-guided image editing | EmuEdit | DINO Score0.833 | 10 | |
| Text-guided inpainting | 1K x 1K resolution dataset | FID39.5 | 5 | |
| Text-guided Editing | I2EBench | SSIM0.508 | 4 | |
| Text-guided Editing | CompBench | CLIP Score20.6 | 4 | |
| Text-guided Editing | ImgEdit | Composite Score3.51 | 4 | |
| Image-guided inpainting | 1K x 1K resolution dataset | FID41.9 | 3 |