Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SS3D: End2End Self-Supervised 3D from Web Videos

About

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationNYU v2 (test)
Abs Rel0.09
320
Camera pose estimationSintel
ATE0.09
203
Depth EstimationKITTI
RMSE4.016
156
Monocular Depth EstimationKITTI Eigen (test)
AbsRel0.064
56
Camera pose estimationTUM-RGBD dynamics
ATE0.092
8
Camera intrinsics estimationSintel (all sequences)
AFE (px)256.6
7
Showing 6 of 6 rows

Other info

Follow for update