Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis
About
Any-scale image synthesis offers an efficient and scalable solution to synthesize photo-realistic images at any scale, even going beyond 2K resolution. However, existing GAN-based solutions depend excessively on convolutions and a hierarchical architecture, which introduce inconsistency and the $``$texture sticking$"$ issue when scaling the output resolution. From another perspective, INR-based generators are scale-equivariant by design, but their huge memory footprint and slow inference hinder these networks from being adopted in large-scale or real-time systems. In this work, we propose $\textbf{C}$olumn-$\textbf{R}$ow $\textbf{E}$ntangled $\textbf{P}$ixel $\textbf{S}$ynthesis ($\textbf{CREPS}$), a new generative model that is both efficient and scale-equivariant without using any spatial convolutions or coarse-to-fine design. To save memory footprint and make the system scalable, we employ a novel bi-line representation that decomposes layer-wise feature maps into separate $``$thick$"$ column and row encodings. Experiments on various datasets, including FFHQ, LSUN-Church, MetFaces, and Flickr-Scenery, confirm CREPS' ability to synthesize scale-consistent and alias-free images at any arbitrary resolution with proper training and inference speed. Code is available at https://github.com/VinAIResearch/CREPS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Unconditional Image Generation | LSUN Church 256x256 | FID5.5 | 14 | |
| Unconditional image synthesis | FFHQ 1024 | FID4.09 | 12 | |
| Image Synthesis | FFHQ 1024 (test) | FID (50k)4.09 | 9 | |
| Image Synthesis | LSUN Church 256x256 (test) | FID5.5 | 6 | |
| Image Synthesis | FFHQ 512 (test) | FID4.43 | 3 | |
| Unconditional image synthesis | FFHQ 512 | FID4.43 | 3 | |
| Unconditional image synthesis | Scenery-256 | FID7.21 | 3 | |
| Unconditional image synthesis | MetFaces 1024 | FID20.52 | 2 |