Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visual Implicit Geometry Transformer for Autonomous Driving

About

We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.

Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin• 2026

Related benchmarks

TaskDatasetResultRank
3D Occupancy PredictionOcc3D-nuScenes (val)--
144
Pointmap EstimationnuScenes (test)
AbsRel0.068
15
Pointmap EstimationArgoverse 2 (AV2) (test)
AbsRel0.131
15
Pointmap EstimationONCE (test)
AbsRel0.169
15
Pointmap EstimationNuPlan subsampled (test)
AbsRel0.118
15
Pointmap EstimationWaymo (test)
AbsRel0.121
15
Pointmap EstimationAggregate (NuScenes, AV2, Waymo, ONCE, NuPlan)
Average Rank1.8
9
Showing 7 of 7 rows

Other info

Follow for update