Semantic Video CNNs through Representation Warping
About
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called NetWarp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models will be available at http://segmentation.is.tue.mpg.de
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes (test) | mIoU80.5 | 1145 | |
| Semantic segmentation | CamVid (test) | mIoU67.1 | 411 | |
| Video Semantic Segmentation | Cityscapes (val) | mIoU80.6 | 91 | |
| Video Semantic Segmentation | VSPW (test) | mIoU37.5 | 25 | |
| Video Semantic Segmentation | CamVid | mIoU67.1 | 14 | |
| Semantic segmentation | RuralScapes 12 semantic classes (val) | mIoU63.99 | 12 | |
| Semantic segmentation | UAVid 8 semantic classes (val) | mIoU79.31 | 12 | |
| Video Semantic Segmentation | CamVid (val) | mIoU67.1 | 4 |