InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

About

Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi• 2026

Related benchmarks

Task	Dataset	Result
Video Super-Resolution	SPMCS (test)	Avg. PSNR21.764	45
Video Super-Resolution	YouHQ (test)	PSNR21.869	9
Video Super-Resolution	30-frame 2K Video (test)	Inference Time (min)0.77	8

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord