Unsupervised Learning of Disentangled Representations from Video
About
We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.
Remi Denton, Vighnesh Birodkar• 2017
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Human Pose Estimation | MPI-INF-3DHP (test) | -- | 559 | |
| Video Prediction | Moving MNIST (test) | MSE45.2 | 82 | |
| Video Prediction | Coloured dSprites (test) | MSE15.2 | 5 | |
| Video Prediction | Sprites (test) | MSE94.4 | 5 | |
| Disentangled Representation Learning | Sprites (test) | Gender Accuracy80.5 | 4 | |
| Disentangled Representation Learning | Coloured dSprites (test) | Shape Accuracy95.7 | 4 |
Showing 6 of 6 rows