Improved Transformer for High-Resolution GANs
About
Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 30.83 and 2.95 on unconditional ImageNet $128 \times 128$ and FFHQ $256 \times 256$, respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions. Our code is made publicly available at https://github.com/google-research/hit-gan
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Unconditional Image Generation | FFHQ 256x256 | FID2.95 | 64 | |
| Image Generation | CelebA-HQ (test) | FID3.39 | 42 | |
| Image Generation | ImageNet 1k (train) | FID30.83 | 29 | |
| Image Generation | FFHQ (test) | FID2.95 | 21 | |
| Unconditional image synthesis | CelebA-HQ 256 x 256 | FID3.39 | 16 | |
| Image Generation | FFHQ 256x256 50k (test) | FID2.58 | 15 | |
| Unconditional image synthesis | FFHQ 1024 | FID6.37 | 12 | |
| Unconditional Image Generation | ImageNet 128x128 (train) | FID30.83 | 9 | |
| Image Generation | FFHQ 1024x1024 50k (test) | FID6.37 | 7 | |
| Image Reconstruction | ImageNet 256x256 (test) | FID6.37 | 5 |