Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

About

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim• 2025

Related benchmarks

TaskDatasetResultRank
Short Video GenerationVBench-Long 60 seconds
Aesthetic Quality59.63
13
Long Video GenerationVBench-Long 60 seconds
Subject Consistency96.96
12
Video GenerationVBench standard prompt (5s setting)
Dynamic Score63.89
11
Short Video GenerationVBench-Long 30 seconds
Aesthetic Quality59.87
10
Long Video GenerationVBench Long 120 seconds
Aesthetic Quality59.16
8
Long Video GenerationVBench-Long (240 seconds)
Aesthetic Quality57.75
8
Long Video GenerationVBench 120s generation
Dynamic Degree52.84
6
Video GenerationnuScenes 16-second videos (test)
Overall FID50.3
6
Video GenerationODV-YT 32-second videos
FID (Overall)42.3
6
Long Video GenerationVBench-Long 30 seconds
Subject Consistency97.34
6
Showing 10 of 18 rows

Other info

Follow for update