Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

About

A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large- scale natural video datasets, and show generalization results on out-of-domain images.

Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang• 2021

Related benchmarks

TaskDatasetResultRank
View SynthesisCO3D-Hydrants (test)
LPIPS0.5113
12
View SynthesisKITTI (test)
PSNR15.17
11
View SynthesisCO3D 10 (test)
LPIPS0.5376
3
View SynthesisRealEstate10K (test)
LPIPS0.4835
3
Pose EstimationCO3D Hydrant short-sequence
ATE1.8
2
Pose EstimationCO3D 10-Category short-sequence
ATE0.66
2
Pose EstimationRealEstate10K short-sequence
ATE0.096
2
Pose EstimationKITTI short-sequence
ATE0.12
2
Showing 8 of 8 rows

Other info

Follow for update