PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

About

Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

Ananthu Aniraj, Cassio F.Dantas, Dino Ienco, Diego Marcos• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	Waterbirds	Average Accuracy94.2	209
Image Classification	ImageNet-1K	Accuracy83.3	52
Image Classification	MetaShift	Average Accuracy83.2	33
Image Classification	WaterBird (OOD)	Accuracy76.8	20
Classification	ImageNet-9 Backgrounds Challenge	Accuracy (Original IN-9)98.4	17
Image Classification	CUB (in-distrib.)	Top-1 Accuracy89.1	10
Pneumothorax detection	SIIM-ACR (test)	AUC (A)92.6	9
Part Discovery	CUB	ARI47.76	6
Part Discovery	CelebA	ARI57.94	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord