PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
About
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Waterbirds | Average Accuracy94.2 | 157 | |
| Image Classification | ImageNet-1K | Accuracy83.3 | 43 | |
| Image Classification | MetaShift | Average Accuracy83.2 | 33 | |
| Image Classification | WaterBird (OOD) | Accuracy76.8 | 20 | |
| Classification | ImageNet-9 Backgrounds Challenge | Accuracy (Original IN-9)98.4 | 17 | |
| Image Classification | CUB (in-distrib.) | Top-1 Accuracy89.1 | 10 | |
| Pneumothorax detection | SIIM-ACR (test) | AUC (A)92.6 | 9 |