DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
About
Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | VBench 5s | Quality Score84.68 | 73 | |
| Long-horizon Video Generation | VBench Long (75s) | Text Alignment28.4 | 8 | |
| Long-horizon Video Generation | VBench Long (100s) | Text Alignment28.36 | 8 | |
| Video Generation | VBench Long 50s | Text Alignment28.39 | 8 |