SS3D: End2End Self-Supervised 3D from Web Videos

About

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera• 2026

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	NYU v2 (test)	Abs Rel0.09	327
Camera pose estimation	Sintel	ATE0.09	203
Depth Estimation	KITTI	RMSE4.016	184
Monocular Depth Estimation	KITTI Eigen (test)	AbsRel0.064	56
Monocular Depth Estimation	KITTI	AbsRel6.4	33
Camera pose estimation	Sintel	ATE0.09	16
Camera pose estimation	TUM-RGBD dynamics	ATE0.092	11
Camera intrinsics estimation	Sintel (all sequences)	AFE (px)256.6	7
Intrinsic parameter estimation	Sintel	AFE (px)256.6	5
3D Semantic Segmentation	KITTI Odometry Open-Vocabulary 3D Segmentation Protocol	mIoU17.5	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord