Video Frame Interpolation Transformer

About

Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets.

Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, Ming-Hsuan Yang• 2021

Related benchmarks

Task	Dataset	Result
Video Frame Interpolation	UCF101	PSNR33.44	122
Video Frame Interpolation	DAVIS	PSNR28.09	33
Video Frame Interpolation	Vimeo-90k	PSNR36.963	24
Video Frame Interpolation	Vimeo-90K septuplet	PSNR36.96	20
Video Frame Interpolation	UCF101	PSNR33.837	12
Video Frame Interpolation	GDM	PSNR30.217	12
Video Interpolation	Vimeo-90K septuplet (test)	Run-time0.08	5

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord