Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

About

Transformer has been very successful in various computer vision tasks and understanding the working mechanism of transformer is important. As touchstones, weakly-supervised semantic segmentation (WSSS) and class activation map (CAM) are useful tasks for analyzing vision transformers (ViT). Based on the plain ViT pre-trained with ImageNet classification, we find that multi-layer, multi-head self-attention maps can provide rich and diverse information for weakly-supervised semantic segmentation and CAM generation, e.g., different attention heads of ViT focus on different image areas and object categories. Thus we propose a novel method to end-to-end estimate the importance of attention heads, where the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results efficiently and effectively. Furthermore, the gradient clipping decoder can make good use of the knowledge in large-scale pre-trained ViT and has a scalable ability. The proposed plain Transformer-based Weakly-supervised learning method (WeakTr) obtains the superior WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of PASCAL VOC 2012 and 51.1% mIoU on the val set of COCO 2014. Source code and checkpoints are available at https://github.com/hustvl/WeakTr.

Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU78.5
2142
Semantic segmentationPASCAL VOC 2012 (test)
mIoU79.4
1415
Semantic segmentationPASCAL VOC (val)
mIoU81.4
362
Semantic segmentationCOCO 2014 (val)
mIoU51.1
304
Semantic segmentationPascal VOC (test)
mIoU78.4
236
Weakly supervised semantic segmentationPASCAL VOC 2012 (val)
mIoU78.4
168
Semantic segmentationCOCO (val)
mIoU53.7
150
Weakly supervised semantic segmentationPASCAL VOC 2012 (train)--
53
Showing 8 of 8 rows

Other info

Code

Follow for update