Video Frame Interpolation Transformer
About
Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Frame Interpolation | UCF101 | PSNR33.44 | 117 | |
| Video Frame Interpolation | DAVIS | PSNR28.09 | 33 | |
| Video Frame Interpolation | Vimeo-90K septuplet | PSNR36.96 | 20 | |
| Video Frame Interpolation | Vimeo-90k | PSNR36.963 | 18 | |
| Video Frame Interpolation | UCF101 | PSNR33.837 | 12 | |
| Video Frame Interpolation | GDM | PSNR30.217 | 12 | |
| Video Interpolation | Vimeo-90K septuplet (test) | Run-time0.08 | 5 |