Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
About
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | COCO-20i | mIoU (Mean)19.6 | 132 | |
| Few-shot Segmentation | Pascal-5^i 1-way 1-shot | -- | 71 | |
| Few-shot Segmentation | COCO-20 | mIoU48.7 | 22 | |
| Few-shot Segmentation | Pascal-5^i 2-way 1-shot | Score (S=0)35.7 | 9 | |
| Few-shot classification | Pascal-5^i 2-way 1-shot | Accuracy (S^0)74.3 | 8 | |
| Few-shot classification | Pascal-5^i 1-way 1-shot | Accuracy (S^0)84 | 8 | |
| Classification | Pascal-5^i 1-shot (test) | 1-way Acc85.7 | 5 | |
| Few-Shot Classification and Segmentation | Pascal-5i 1-way 1-shot | Classification 0/1 Exact Ratio (S0)86.9 | 5 | |
| Few-Shot Classification and Segmentation | Pascal-5i 2-way 1-shot | Classification 0/1 Ratio (S0)70.3 | 5 | |
| Few-shot Segmentation | COCO-20i 1-way 1-shot | mIoU38.3 | 5 |