Scalable Neural Video Representations with Learnable Positional Features
About
Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the PSNR metric), even using $>$8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Compression | UVG standard (full) | Beauty Quality Score34.41 | 24 | |
| Video Encoding | UVG-HD | PSNR37.61 | 19 | |
| Implicit Video Representation | UVG-HD full 1920x1080 | PSNR (Beauty)34.41 | 18 | |
| Video Compression | Big Buck Bunny | BPP0.136 | 6 | |
| Video Decoding | UVG Jockey 1920x1080 (600 frames) | BPP0.172 | 5 | |
| Video Frame Interpolation | Big Buck Bunny | PSNR33.76 | 2 |