Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

About

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench 5s
Quality Score84.68
73
Long-horizon Video GenerationVBench Long (75s)
Text Alignment28.4
8
Long-horizon Video GenerationVBench Long (100s)
Text Alignment28.36
8
Video GenerationVBench Long 50s
Text Alignment28.39
8
Showing 4 of 4 rows

Other info

Follow for update