Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

StableDPT: Temporal Stable Monocular Video Depth Estimation

About

Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy for processing videos of arbitrary length avoiding the scale misalignment and redundant computations associated with overlapping windows used in other methods. Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.

Ivan Sobko, Hayko Riemenschneider, Markus Gross, Christopher Schroers• 2026

Related benchmarks

TaskDatasetResultRank
Depth EstimationKITTI
AbsRel0.13
92
Depth EstimationTUM-RGBD
Abs Rel Error0.12
16
Depth EstimationSintel
AbsRel0.35
12
Depth EstimationInfinigen
AbsRel0.29
9
Depth Estimation192-frame sequence 518x924 resolution (inference)
Inference Time (s)4.4
5
Depth EstimationAverage (Infinigen, Sintel, KITTI, TUM RGB-D)
AbsRel0.22
5
Showing 6 of 6 rows

Other info

Follow for update