RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution

About

Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. The existing methods based on Convolutional Neural Network~(CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architectures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explicitly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the input LFR and LR frames, which is then utilized in the decoder part to synthesize the HFR and HR frames. Compared with the state-of-the-art TMNet \cite{xu2021temporal}, our network is $60\%$ smaller (4.5M vs 12.3M parameters) and $80\%$ faster (26.2fps vs 14.3fps on $720\times576$ frames) without sacrificing much performance. The source code is available at https://github.com/llmpass/RSTT.

Zhicheng Geng, Luming Liang, Tianyu Ding, Ilya Zharkov• 2022

Related benchmarks

Task	Dataset	Result
Space-Time Video Super-Resolution	Vid4 (test)	PSNR26.43	46
Space-Time Video Super-Resolution	Vid4	PSNR26.43	41
Video Super-Resolution	Vimeo-90K Medium (test)	PSNR (dB)35.66	39
Video Super-Resolution	Vimeo-90K Slow (test)	PSNR (dB)33.5	39
Video Super-Resolution	Vimeo-90K Fast (test)	PSNR (dB)36.8	39
Video Super-Resolution	Vimeo-90k Fast	PSNR36.8	35
Video Super-Resolution	Vimeo-90k Slow	PSNR33.5	30
Video Super-Resolution	Vimeo-90k Medium	PSNR35.66	30

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord