Temporally Efficient Vision Transformer for Video Instance Segmentation

About

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.

Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan• 2022

Related benchmarks

Task	Dataset	Result
Video Instance Segmentation	YouTube-VIS 2019 (val)	AP46.6	604
Video Instance Segmentation	YouTube-VIS 2021 (val)	AP37.9	356
Video Instance Segmentation	OVIS (val)	AP17.4	301
Video Instance Segmentation	YouTube-VIS 2019 (test)	AP56.8	13
Video Instance Segmentation	OVIS (test)	AP17.4	12
Audio-Visual Sound Segmentation	AVISeg (test)	FSLA32.28	12
Video Instance Segmentation	OVIS 1.0 (val)	AP17.4	11
Video Instance Segmentation	AVISeg (test)	FSLA32.28	7

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord