Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

About

Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench HunyuanVideo
Consistency99.06
15
Long Video GenerationVBench 4x extended video lengths 1.0
Subject Consistency98.69
12
Long Video GenerationVBench 2x extension 1.0
Subject Consistency97.32
12
Video ExtrapolationVBench CogVideoX with 2x extrapolation v1.0 (test)
NoRepeat Score99.42
7
Video ExtrapolationVBench CogVideoX with 4x extrapolation v1.0 (test)
NoRepeat Score97
7
Video ExtrapolationVBench Wan with 2x extrapolation v1.0 (test)
Dynamic Score32
7
Video ExtrapolationVBench CogVideoX with 3x extrapolation v1.0 (test)
NoRepeat Score97.86
7
Video ExtrapolationVBench HunyuanVideo with 2x extrapolation v1.0 (test)
NoRepeat Score97.27
7
Video ExtrapolationVBench HunyuanVideo with 5x extrapolation v1.0 (test)
NoRepeat Score53.65
7
Text-to-Video GenerationUser Study long videos Wan2.1-T2V-1.3B base model
Content Consistency3.31
6
Showing 10 of 10 rows

Other info

Follow for update