SegViT: Semantic Segmentation with Plain Vision Transformers

About

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, Yifan Liu• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	--	3069
Semantic segmentation	ADE20K	mIoU58	1028
Semantic segmentation	PASCAL Context (val)	--	360
Semantic segmentation	Pascal Context (test)	--	223
Semantic segmentation	Pascal Context	mIoU66.61	217
Semantic segmentation	Pascal Context 59	--	79
Semantic segmentation	COCO-Stuff-10K (test)	mIoU50.3	47
Semantic segmentation	ISPRS Vaihingen (test)	F1 Score88.2	47
Semantic segmentation	COCOStuff 164K	--	39
Semantic segmentation	ISPRS Potsdam	mIoU86.85	32

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord