DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

About

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang• 2026

Related benchmarks

Task	Dataset	Result
Video Generation	VBench 5s	Quality Score84.68	97
Long-horizon Video Generation	VBench Long (75s)	Text Alignment28.4	8
Long-horizon Video Generation	VBench Long (100s)	Text Alignment28.36	8
Video Generation	VBench Long 50s	Text Alignment28.39	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord