Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

About

Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.

Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia• 2025

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationMVSEC Depth
RMSE6.465
20
Monocular Depth EstimationDSEC-Depth
RMSE8.88
20
Metric Depth EstimationMVSEC (night1)
MAE (10m)1.87
9
Metric Depth EstimationMVSEC (day1)
MAE (10m)1.5
9
Metric Depth EstimationMVSEC (night2)
MAE (10m)1.99
9
Metric Depth EstimationMVSEC (night3)
Absolute MAE (10m)2.05
9
Showing 6 of 6 rows

Other info

Follow for update