Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks

About

In this work, we research and evaluate end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird's eye view mapping. At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian coordinate system. The evaluations on Cityscapes show that the end-to-end learning of semantic-metric occupancy grids outperforms the deterministic mapping approach with flat-plane assumption by more than 12% mean IoU. Furthermore, we show that the variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data. Our network achieves real-time inference rates of approx. 35 Hz for an input image with a resolution of 256x512 pixels and an output map with 64x64 occupancy grid cells using a Titan V GPU.

Chenyang Lu, Marinus Jacobus Gerardus van de Molengraft, Gijs Dubbelman• 2018

Related benchmarks

Task	Dataset	Result
Semantic segmentation	nuScenes (val)	--	323
LiDAR Semantic Segmentation	nuScenes official (test)	mIoU8.8	196
BEV Semantic Segmentation	nuScenes (val)	Drivable Area IoU54.7	55
BeV Segmentation	nuScenes v1.0 (val)	Drivable Area60.82	25
BeV Segmentation	nuScenes (val)	Vehicle Segmentation Score23.3	16
Map-view Semantic Segmentation	Argoverse (val)	Vehicle IoU14	9
Top-view semantic segmentation	Argoverse Road	mIoU72.84	8
Top-view semantic segmentation	Argoverse Vehicle	mIoU24.16	8
Vehicle Segmentation	nuScenes Setting 1: 100m x 50m at 25cm resolution v1.0-trainval (val)	mIoU8.8	7
Object Detection	nuScenes v1.0 (val)	--	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord