Object-driven Text-to-Image Synthesis via Adversarial Training
About
In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow object-centered text-to-image synthesis for complex scenes. Following the two-step (layout-image) generation process, a novel object-driven attentive image generator is proposed to synthesize salient objects by paying attention to the most relevant words in the text description and the pre-generated semantic layout. In addition, a new Fast R-CNN based object-wise discriminator is proposed to provide rich object-wise discrimination signals on whether the synthesized object matches the text description and the pre-generated layout. The proposed Obj-GAN significantly outperforms the previous state of the art in various metrics on the large-scale COCO benchmark, increasing the Inception score by 27% and decreasing the FID score by 11%. A thorough comparison between the traditional grid attention and the new object-driven attention is provided through analyzing their mechanisms and visualizing their attention layers, showing insights of how the proposed model generates complex scenes in high quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | MS-COCO | FID36.52 | 75 | |
| Text-to-Image Synthesis | MS-COCO (val) | FID36.52 | 35 | |
| Grounded Text-to-Image Generation | COCO 2014 (val) | FID20.75 | 26 | |
| Text-to-Image Generation | COCO 2014 (test) | FID36.52 | 25 | |
| Text-to-Image Generation | COCO (test) | FID25.85 | 18 | |
| Floor plan generation | T2D (test) | Micro IoU15.74 | 15 | |
| Text-to-Image Synthesis | MS-COCO | SOA-C27.14 | 13 |