Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

About

We introduce a framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the VG and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task's complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: 1) achieving superior results in a range of grounded scene graph completion tasks, and 2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG.

Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal• 2024

Related benchmarks

Task	Dataset	Result
Visual Relation Detection	Visual Genome (test)	R@5074.4	13
Scene Graph Generation	COCO	N-MMD3.93	7
Layout Generation	Visual Genome	F1-std17.53	5
Layout Generation	COCO	F1-std46.44	5
Scene Graph Generation	LAION-SG	N-MMD1.36	3
Single Object Completion	Visual Genome (test)	w1 Score10.2	3
Single Object Completion	CompSGBench (test)	w1 Score8.6	3
Single Relation Completion	CompSGBench (test)	w1 Score8.8	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord