Unsupervised Learning of Disentangled Representations from Video

About

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

Remi Denton, Vighnesh Birodkar• 2017

Related benchmarks

Task	Dataset	Result
3D Human Pose Estimation	MPI-INF-3DHP (test)	--	606
Video Prediction	Moving MNIST (test)	MSE45.2	82
Video Prediction	Coloured dSprites (test)	MSE15.2	5
Video Prediction	Sprites (test)	MSE94.4	5
Disentangled Representation Learning	Sprites (test)	Gender Accuracy80.5	4
Disentangled Representation Learning	Coloured dSprites (test)	Shape Accuracy95.7	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord