Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InfVSR: Toward Consistency-Driven Streaming Generative Video Super-Resolution

About

Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Video Super-ResolutionUDM10
PSNR24.86
88
Video Super-ResolutionSPMCS
PSNR22.25
61
Video Super-ResolutionMVSR4x
PSNR22.49
49
Video Super-ResolutionVideoLQ
MUSIQ56.26
17
Video Super-Resolutionvideo 33-frame 720x1280
Inference Time (s)6.82
13
Video Super-Resolution720p videos 100 frames
Time (s)20.7
6
Video Super-ResolutionVideoLQ
MUSIQ Score56.26
3
Showing 7 of 7 rows

Other info

Follow for update