Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks
About
In this work, we research and evaluate end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird's eye view mapping. At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian coordinate system. The evaluations on Cityscapes show that the end-to-end learning of semantic-metric occupancy grids outperforms the deterministic mapping approach with flat-plane assumption by more than 12% mean IoU. Furthermore, we show that the variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data. Our network achieves real-time inference rates of approx. 35 Hz for an input image with a resolution of 256x512 pixels and an output map with 64x64 occupancy grid cells using a Titan V GPU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | nuScenes (val) | -- | 212 | |
| LiDAR Semantic Segmentation | nuScenes official (test) | mIoU8.8 | 132 | |
| BEV Semantic Segmentation | nuScenes (val) | Drivable Area IoU54.7 | 28 | |
| BeV Segmentation | nuScenes v1.0 (val) | Drivable Area60.82 | 25 | |
| BeV Segmentation | nuScenes (val) | Vehicle Segmentation Score23.3 | 16 | |
| Map-view Semantic Segmentation | Argoverse (val) | Vehicle IoU14 | 9 | |
| Top-view semantic segmentation | Argoverse Road | mIoU72.84 | 8 | |
| Top-view semantic segmentation | Argoverse Vehicle | mIoU24.16 | 8 | |
| Vehicle Segmentation | nuScenes Setting 1: 100m x 50m at 25cm resolution v1.0-trainval (val) | mIoU8.8 | 7 | |
| Object Detection | nuScenes v1.0 (val) | -- | 7 |