WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

About

Transformer has been very successful in various computer vision tasks and understanding the working mechanism of transformer is important. As touchstones, weakly-supervised semantic segmentation (WSSS) and class activation map (CAM) are useful tasks for analyzing vision transformers (ViT). Based on the plain ViT pre-trained with ImageNet classification, we find that multi-layer, multi-head self-attention maps can provide rich and diverse information for weakly-supervised semantic segmentation and CAM generation, e.g., different attention heads of ViT focus on different image areas and object categories. Thus we propose a novel method to end-to-end estimate the importance of attention heads, where the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results efficiently and effectively. Furthermore, the gradient clipping decoder can make good use of the knowledge in large-scale pre-trained ViT and has a scalable ability. The proposed plain Transformer-based Weakly-supervised learning method (WeakTr) obtains the superior WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of PASCAL VOC 2012 and 51.1% mIoU on the val set of COCO 2014. Source code and checkpoints are available at https://github.com/hustvl/WeakTr.

Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU78.5	2204
Semantic segmentation	PASCAL VOC 2012 (test)	mIoU79.4	1477
Semantic segmentation	PASCAL VOC (val)	mIoU81.4	380
Semantic segmentation	COCO 2014 (val)	mIoU51.1	304
Semantic segmentation	Pascal VOC (test)	mIoU78.4	268
Semantic segmentation	COCO (val)	mIoU53.7	185
Weakly supervised semantic segmentation	PASCAL VOC 2012 (val)	mIoU78.4	168
Semantic segmentation	PASCAL VOC 2012 (val)	mIoU78.4	166
Weakly supervised semantic segmentation	PASCAL VOC 2012 (train)	mIoU80.3	120

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord