Semantic Video CNNs through Representation Warping

About

In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called NetWarp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models will be available at http://segmentation.is.tue.mpg.de

Raghudeep Gadde, Varun Jampani, Peter V. Gehler• 2017

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Cityscapes (test)	mIoU80.5	1252
Semantic segmentation	CamVid (test)	mIoU67.1	411
Video Semantic Segmentation	Cityscapes (val)	mIoU80.6	103
Video Semantic Segmentation	CamVid	mIoU67.1	41
Video Semantic Segmentation	VSPW (test)	mIoU37.5	25
Semantic segmentation	RuralScapes 12 semantic classes (val)	mIoU63.99	12
Semantic segmentation	UAVid 8 semantic classes (val)	mIoU79.31	12
Video Semantic Segmentation	CamVid (val)	mIoU67.1	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord