Striving for Simplicity: The All Convolutional Net
About
Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | -- | 3518 | |
| Image Classification | CIFAR-10 (test) | -- | 3381 | |
| Image Classification | ImageNet (val) | -- | 1206 | |
| Image Classification | CIFAR-10 (test) | Accuracy92.8 | 906 | |
| Image Classification | CIFAR-100 | Accuracy66.29 | 691 | |
| Image Classification | CIFAR-10 | Accuracy92.75 | 564 | |
| Crowd Counting | ShanghaiTech Part A (test) | MAE88.2 | 271 | |
| Classification | CIFAR-100 (test) | Accuracy66.29 | 129 | |
| Crowd Counting | UCF-QNRF (test) | MAE147.2 | 113 | |
| Explainability | ImageNet (val) | Insertion37.7 | 104 |