Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

INR-V: A Continuous Representation Space for Video-based Generative Tasks

About

Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space.

Bipasha Sen, Aditya Agarwal, Vinay P Namboodiri, C. V. Jawahar• 2022

Related benchmarks

TaskDatasetResultRank
Video GenerationSkyTimelapse (test)
FVD16153.4
16
Video GenerationHow2Sign Faces (test)
FVD1687.22
5
Video GenerationMoving MNIST (test)
FVD (16 frames)47.28
5
Video GenerationRainbowJelly (test)
FVD16260.7
4
Video InterpolationHow2Sign-Faces and SkyTimelapse
User Preference100
3
Video InversionHow2Sign Faces (test)
GT-ID0.77
3
SuperresolutionHow2Sign Faces (test)
GT Identity Score0.734
3
Video SuperresolutionRainbowJelly 200 x 200 2x
PSNR28.62
3
Video SuperresolutionRainbowJelly 3x (300 x 300) 2048 videos randomly sampled
PSNR29.17
3
Frame InterpolationHow2Sign Faces (test)
GT ID0.702
2
Showing 10 of 14 rows

Other info

Code

Follow for update