Temporally Distributed Networks for Fast Video Semantic Segmentation
About
We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes (test) | mIoU74.9 | 1145 | |
| Semantic segmentation | Cityscapes (val) | mIoU79.9 | 572 | |
| Semantic segmentation | CamVid (test) | mIoU76 | 411 | |
| Semantic segmentation | Cityscapes (val) | mIoU75 | 332 | |
| Semantic segmentation | NYU Depth V2 (test) | mIoU43.5 | 172 | |
| Video Semantic Segmentation | Cityscapes (val) | mIoU79.9 | 91 | |
| Semantic segmentation | UAVid (test) | mIoU52 | 37 | |
| Semantic segmentation | NYU Depth V2 | -- | 26 | |
| Semantic Video Segmentation | Cityscapes (test) | mIoU79.4 | 24 | |
| Surgical Instrument Segmentation | EndoVis 2017 (test) | mIoU49.24 | 22 |