Test-Time 3D Occupancy Prediction
About
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes, and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible representation of 3D Gaussians enables voxelization at arbitrary user-specified resolutions, while the strong generalization capability of VFMs supports accurate perception and open-vocabulary recognition without requiring any network training or fine-tuning. To validate the generality and effectiveness of our framework, we present two variants: a LiDAR-based version and a vision-centric version, and conduct extensive experiments on the Occ3D-nuScenes and nuCraft benchmarks under varying voxel resolutions. Experimental results show that TT-Occ significantly outperforms existing computationally expensive pretrained self-supervised counterparts. Code is available at https://github.com/Xian-Bei/TT-Occ.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Semantic Occupancy Prediction | Occ3D | RayIoU13.4 | 40 | |
| 3D Semantic Occupancy Prediction | Occ3D-nuScenes v1.0 (val) | mIoU27.41 | 26 | |
| Semantic Occupancy Estimation | Occ3D-nuScenes | mIoU16.7 | 9 | |
| 3D Semantic Occupancy Prediction | nuCraft high-resolution | Overall mIoU10.92 | 4 | |
| Occupancy Prediction | nuScenes Rainy and Nighttime scenes v1.0 (test) | Score 0911 (Rainy)27 | 3 |