TubeFormer-DeepLab: Video Mask Transformer
About
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Instance Segmentation | YouTube-VIS 2019 (val) | AP47.5 | 567 | |
| Video Instance Segmentation | YouTube-VIS 2021 (val) | AP41.2 | 344 | |
| Video Semantic Segmentation | VSPW (val) | mIoU63.2 | 92 | |
| Video Instance Segmentation | YouTube-VIS 2019 | AP47.5 | 75 | |
| Video Panoptic Segmentation | VIPSeg (val) | VPQ31.2 | 73 | |
| Video Instance Segmentation | YouTube-VIS 2021 | AP41.2 | 63 | |
| Video Semantic Segmentation | VSPW | mIoU63.2 | 25 | |
| Video Panoptic Segmentation | KITTI-STEP (val) | STQ70 | 22 | |
| Video Panoptic Segmentation | KITTI-STEP (test) | STQ65.25 | 15 | |
| Video Panoptic Segmentation | VIPSeg (test) | STQ38.6 | 15 |