Parallel Multiscale Autoregressive Density Estimation
About
PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; O(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical generation of 512x512 images. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Density Estimation | ImageNet 32x32 (test) | Bits per Sub-pixel3.95 | 66 | |
| Density Estimation | ImageNet 64x64 (test) | Bits Per Sub-Pixel3.7 | 62 | |
| Unconditional Image Generation | ImageNet-32 | BPD3.95 | 31 | |
| Generative Modeling | ImageNet 32x32 downsampled | Bits Per Dimension3.95 | 24 | |
| Unconditional Image Generation | ImageNet 64 | BPD3.7 | 22 | |
| Unconditional image modeling | ImageNet 64x64 | Bits/Dim3.7 | 17 | |
| Density Estimation | ImageNet 64 | Bits-per-dimension3.7 | 16 | |
| Density Estimation | ImageNet 64x64 (val) | Bits/dim3.7 | 13 | |
| Sampling | ImageNet 32x32 | Sampling Time (s)1.17 | 9 | |
| Unconditional image modeling | ImageNet 32x32 | Bits/Dim3.95 | 8 |