WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
About
Transformer has been very successful in various computer vision tasks and understanding the working mechanism of transformer is important. As touchstones, weakly-supervised semantic segmentation (WSSS) and class activation map (CAM) are useful tasks for analyzing vision transformers (ViT). Based on the plain ViT pre-trained with ImageNet classification, we find that multi-layer, multi-head self-attention maps can provide rich and diverse information for weakly-supervised semantic segmentation and CAM generation, e.g., different attention heads of ViT focus on different image areas and object categories. Thus we propose a novel method to end-to-end estimate the importance of attention heads, where the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results efficiently and effectively. Furthermore, the gradient clipping decoder can make good use of the knowledge in large-scale pre-trained ViT and has a scalable ability. The proposed plain Transformer-based Weakly-supervised learning method (WeakTr) obtains the superior WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of PASCAL VOC 2012 and 51.1% mIoU on the val set of COCO 2014. Source code and checkpoints are available at https://github.com/hustvl/WeakTr.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU78.5 | 2142 | |
| Semantic segmentation | PASCAL VOC 2012 (test) | mIoU79.4 | 1415 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU81.4 | 362 | |
| Semantic segmentation | COCO 2014 (val) | mIoU51.1 | 304 | |
| Semantic segmentation | Pascal VOC (test) | mIoU78.4 | 236 | |
| Weakly supervised semantic segmentation | PASCAL VOC 2012 (val) | mIoU78.4 | 168 | |
| Semantic segmentation | COCO (val) | mIoU53.7 | 150 | |
| Weakly supervised semantic segmentation | PASCAL VOC 2012 (train) | -- | 53 |