Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models
About
We introduce a framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the VG and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task's complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: 1) achieving superior results in a range of grounded scene graph completion tasks, and 2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Relation Detection | Visual Genome (test) | R@5074.4 | 13 | |
| Scene Graph Generation | COCO | N-MMD3.93 | 7 | |
| Layout Generation | Visual Genome | F1-std17.53 | 5 | |
| Layout Generation | COCO | F1-std46.44 | 5 | |
| Scene Graph Generation | LAION-SG | N-MMD1.36 | 3 | |
| Single Object Completion | Visual Genome (test) | w1 Score10.2 | 3 | |
| Single Object Completion | CompSGBench (test) | w1 Score8.6 | 3 | |
| Single Relation Completion | CompSGBench (test) | w1 Score8.8 | 3 |