DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
About
This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints. Our proposed network starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Based on the multi-scale feature propagation, DFANet substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance. Experiments on Cityscapes and CamVid datasets demonstrate the superior performance of DFANet with 8$\times$ less FLOPs and 2$\times$ faster than the existing state-of-the-art real-time semantic segmentation methods while providing comparable accuracy. Specifically, it achieves 70.3\% Mean IOU on the Cityscapes test dataset with only 1.7 GFLOPs and a speed of 160 FPS on one NVIDIA Titan X card, and 71.3\% Mean IOU with 3.4 GFLOPs while inferring on a higher resolution image.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes (test) | mIoU71.3 | 1145 | |
| Semantic segmentation | CamVid (test) | mIoU64.7 | 411 | |
| Semantic segmentation | Cityscapes (val) | mIoU70.3 | 332 | |
| Semantic segmentation | Cityscapes (val) | mIoU71.3 | 287 | |
| Semantic segmentation | Cityscapes (val) | -- | 108 | |
| Semantic segmentation | Trans10K v2 (test) | mIoU42.54 | 104 | |
| Semantic segmentation | Cityscapes fine (test) | mIoU70.3 | 44 | |
| Semantic segmentation | Trans10K v2 | Accuracy85.15 | 27 | |
| Scene Parsing | Cityscapes (test) | mIoU71.3 | 17 |