Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
About
Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | LongVideoBench | -- | 79 | |
| Video Understanding | VideoMME | Wall-time Speedup2.02 | 21 | |
| Image Captioning | COCO Captions | Average Accepted Length (tau)4.13 | 10 | |
| Multimodal Question Answering | ScienceQA (SQA) | Avg Accepted Length3.86 | 10 | |
| Multimodal Understanding | MME | Avg Accepted Length (tau)3.78 | 10 | |
| Multimodal Understanding | MM-Vet | Average Accepted Length (tau)3.82 | 10 | |
| Speculative Decoding | MVBench | Tau (τ)3.87 | 8 | |
| Speculative Decoding | VideoDetailCaption ~17k visual tokens | Tau (τ)3.89 | 8 | |
| Speculative Decoding | LongVideoBench ~15k visual tokens | Tau (τ)3.78 | 8 | |
| Video Detailed Captioning | VideoDetailedCaption | Tau (τ)3.88 | 8 |