DepthFocus: Controllable Depth Estimation for See-Through Scenes
About
Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive; conventional approaches typically estimate static depth maps anchored to the nearest surface, and even recent multi-head extensions suffer from a representational bottleneck due to fixed feature representations. This stands in contrast to human vision, which actively shifts focus to perceive a desired depth. We introduce \textbf{DepthFocus}, a steerable Vision Transformer that redefines stereo depth estimation as condition-aware control. Instead of extracting fixed features, our model dynamically modulates its computation based on a physical reference depth, integrating dual conditional mechanisms to selectively perceive geometry aligned with the desired focus. Leveraging a newly curated large-scale synthetic dataset, \textbf{DepthFocus} achieves state-of-the-art results across all evaluated benchmarks, including both standard single-layer and complex multi-layered scenarios. While maintaining high precision in opaque regions, our approach effectively resolves depth ambiguities in transparent and reflective scenes by selectively reconstructing geometry at a target distance. This capability enables robust, intent-driven perception that significantly outperforms existing multi-layer methods, marking a substantial step toward active 3D perception. \noindent \textbf{Project page}: \href{https://junhong-3dv.github.io/depthfocus-project/}{\textbf{this https URL}}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Depth Estimation | Booster All type | EPE1.56 | 14 | |
| Multi-layer depth estimation | Multi-layered synthetic benchmark Opaque Layer 1 | Bad-2 Error2.74 | 11 | |
| Multi-layer depth estimation | Multi-layered synthetic benchmark Transmissive Layer 1 | Bad-25.47 | 10 | |
| Stereo Matching | Laboratory bilayer benchmark No plate | Bad-4 (Opaque)1.35 | 9 | |
| Stereo Matching | Laboratory bilayer benchmark With plate 60% transmittance | Bad-4 Error (Opaque)1.27 | 9 | |
| Stereo Matching | Laboratory bilayer benchmark With plate 80% transmittance | Bad-4 Error (Opaque)1.15 | 9 | |
| Multi-layer depth estimation | LayeredFlow (val) | Layer 1 EPE3.13 | 8 | |
| Stereo Depth Estimation | Booster (Opaque) | EPE1.07 | 7 | |
| Stereo Depth Estimation | Middlebury (Non Occlusion) | EPE (Endpoint Error)0.67 | 7 | |
| Multi-layer depth estimation | Multi-layered synthetic benchmark Transmissive Layer 4 | Bad-233.01 | 5 |