INR-V: A Continuous Representation Space for Video-based Generative Tasks
About
Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | SkyTimelapse (test) | FVD16153.4 | 16 | |
| Video Generation | How2Sign Faces (test) | FVD1687.22 | 5 | |
| Video Generation | Moving MNIST (test) | FVD (16 frames)47.28 | 5 | |
| Video Generation | RainbowJelly (test) | FVD16260.7 | 4 | |
| Video Interpolation | How2Sign-Faces and SkyTimelapse | User Preference100 | 3 | |
| Video Inversion | How2Sign Faces (test) | GT-ID0.77 | 3 | |
| Superresolution | How2Sign Faces (test) | GT Identity Score0.734 | 3 | |
| Video Superresolution | RainbowJelly 200 x 200 2x | PSNR28.62 | 3 | |
| Video Superresolution | RainbowJelly 3x (300 x 300) 2048 videos randomly sampled | PSNR29.17 | 3 | |
| Frame Interpolation | How2Sign Faces (test) | GT ID0.702 | 2 |