TubeFormer-DeepLab: Video Mask Transformer

About

We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks

Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen• 2022

Related benchmarks

Task	Dataset	Result
Video Instance Segmentation	YouTube-VIS 2019 (val)	AP47.5	604
Video Instance Segmentation	YouTube-VIS 2021 (val)	AP41.2	356
Video Semantic Segmentation	VSPW (val)	mIoU63.2	121
Video Panoptic Segmentation	VIPSeg (val)	VPQ31.2	83
Video Instance Segmentation	YouTube-VIS 2019	AP47.5	75
Video Instance Segmentation	YouTube-VIS 2021	AP41.2	66
Video Semantic Segmentation	VSPW	mIoU63.2	55
Video Panoptic Segmentation	KITTI-STEP (val)	STQ70	22
Video Panoptic Segmentation	KITTI-STEP (test)	STQ65.25	15
Video Panoptic Segmentation	VIPSeg (test)	STQ38.6	15

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord