ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

About

In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available.

Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen• 2020

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	KITTI (test)	Abs Rel Error8.94	114
Video Panoptic Segmentation	Cityscapes-VPS (val)	VPQ70.4	110
Video Panoptic Segmentation	VIPSeg (val)	VPQ16	83
Depth-aware Video Panoptic Segmentation	Cityscapes-DVPS (val)	DVPQ68.7	42
Depth-aware Video Panoptic Segmentation	SemKITTI-DVPS (val)	DVPQ54.7	42
Video Panoptic Segmentation	Cityscapes-VPS (test)	VPQ68.9	32
Video Panoptic Segmentation	VIPSeg	VPQ16	25
Depth Estimation	KITTI public benchmark official (test)	SILog10.8	22
Multi-Object Tracking and Segmentation	KITTI MOTS (val)	sMOTSA (Car)86	18
Video Panoptic Segmentation	VIPSeg-VPS (val)	VPQ^118.4	17

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord