Compositional Transformers for Scene Generation
About
We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflect the compositional nature of visual scenes, and synthesizes images through a sequential process. It operates in two stages: a fast and lightweight planning phase, where we draft a high-level scene layout, followed by an attention-based execution phase, where the layout is being refined, evolving into a rich and detailed picture. Our model moves away from conventional black-box GAN architectures that feature a flat and monolithic latent space towards a transparent design that encourages efficiency, controllability and interpretability. We demonstrate GANformer2's strengths and qualities through a careful evaluation over a range of datasets, from multi-object CLEVR scenes to the challenging COCO images, showing it successfully achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process, as it proceeds step-by-step from a rough initial sketch, to a detailed layout that accounts for objects' depths and dependencies, and up to the final high-resolution depiction of vibrant and intricate real-world scenes. See https://github.com/dorarad/gansformer for model implementation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Unconditional Image Generation | CelebA unconditional 64 x 64 | FID6.87 | 95 | |
| Unconditional Image Generation | FFHQ 256x256 | FID7.77 | 64 | |
| Image Generation | FFHQ 256x256 50k (test) | FID7.77 | 15 | |
| Unconditional Image Generation | CLEVR | FID4.7 | 8 | |
| Unconditional Image Generation | Bedrooms | FID6.05 | 8 | |
| Unconditional Image Generation | COCO | FID21.58 | 8 | |
| Unconditional Image Generation | COCOp | FID20.41 | 8 | |
| Unconditional Image Generation | FFHQ | FID7.77 | 8 | |
| Unconditional Image Generation | Cityscapes | FID6.21 | 8 | |
| Latent-space disentanglement and controllability | CLEVR | Disentanglement0.852 | 6 |