Scalable Neural Video Representations with Learnable Positional Features

About

Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the PSNR metric), even using $>$8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.

Subin Kim, Sihyun Yu, Jaeho Lee, Jinwoo Shin• 2022

Related benchmarks

Task	Dataset	Result
Video Compression	UVG standard (full)	Beauty Quality Score34.41	24
Video Encoding	UVG-HD	PSNR37.61	19
Implicit Video Representation	UVG-HD full 1920x1080	PSNR (Beauty)34.41	18
Video Compression	Big Buck Bunny	BPP0.136	6
Video Decoding	UVG Jockey 1920x1080 (600 frames)	BPP0.172	5
Video Frame Interpolation	Big Buck Bunny	PSNR33.76	2

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord