Differentiable Patch Selection for Image Recognition
About
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network, is able to aggregate information from different patches in a flexible way, and allows the whole model to be trained end-to-end using backpropagation. We show results for traffic sign recognition, inter-patch relationship reasoning, and fine-grained recognition without using object/part bounding box annotations during training.
Jean-Baptiste Cordonnier, Aravindh Mahendran, Alexey Dosovitskiy, Dirk Weissenborn, Jakob Uszkoreit, Thomas Unterthiner• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Time-series classification | SelfRegulationSCP2 | Accuracy55.1 | 55 | |
| Time-series classification | Heartbeat | Accuracy70.5 | 51 | |
| Time-series classification | SelfRegulationSCP1 | Accuracy87.2 | 45 | |
| Multivariate Time Series Classification | Finger Movement | Accuracy58 | 39 | |
| Time-series classification | FaceDetection | Accuracy65.4 | 34 | |
| Multivariate Time Series Classification | MotorImagery | Accuracy53 | 28 | |
| Fine-grained visual classification | CUB-200 | Accuracy86.7 | 24 | |
| Traffic Sign Recognition | Swedish traffic signs dataset Subset setup (test) | Accuracy91.7 | 7 | |
| Binary Classification | Traffic Signs Recognition (test) | Accuracy91.7 | 6 | |
| Time-series classification | WalkingSittingStanding | Accuracy0.897 | 6 |
Showing 10 of 11 rows