Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

About

Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi• 2026

Related benchmarks

TaskDatasetResultRank
Video Super-ResolutionSPMCS (test)
Avg. PSNR21.764
45
Video Super-ResolutionYouHQ (test)
PSNR21.869
9
Video Super-Resolution30-frame 2K Video (test)
Inference Time (min)0.77
8
Showing 3 of 3 rows

Other info

Follow for update