Setting the Stage: Text-Driven Scene-Consistent Image Generation

About

We focus on the foundational task of Scene Staging: given a reference scene image and a text condition specifying an actor category to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same scene identity as the reference image while correctly generating the actor according to the spatial relation described in the text. Existing methods struggle with this task, largely due to the scarcity of high-quality paired data and unconstrained generation objectives. To overcome the data bottleneck, we propose a novel data construction pipeline that combines real-world photographs, entity removal, and image-to-video diffusion models to generate training pairs with diverse scenes, viewpoints and correct entity-scene relationships. Furthermore, we introduce a novel correspondence-guided attention loss that leverages cross-view cues to enforce spatial alignment with the reference scene. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image alignment than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method generates images with diverse viewpoints and compositions while faithfully following the textual instructions and preserving the reference scene identity.

Cong Xie, Che Wang, Yan Zhang, Ruiqi Yu, Han Zou, Zheng Pan, Zhenpeng Zhan• 2025

Related benchmarks

Task	Dataset	Result	Rank
Text-Guided Image Editing	Custom Scene-Prompt 2,125 scene-prompt pairs (test)	G2.5F-SA9.811		4

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord