Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

About

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingLongVideoBench--
79
Video UnderstandingVideoMME
Wall-time Speedup2.02
21
Image CaptioningCOCO Captions
Average Accepted Length (tau)4.13
10
Multimodal Question AnsweringScienceQA (SQA)
Avg Accepted Length3.86
10
Multimodal UnderstandingMME
Avg Accepted Length (tau)3.78
10
Multimodal UnderstandingMM-Vet
Average Accepted Length (tau)3.82
10
Speculative DecodingMVBench
Tau (τ)3.87
8
Speculative DecodingVideoDetailCaption ~17k visual tokens
Tau (τ)3.89
8
Speculative DecodingLongVideoBench ~15k visual tokens
Tau (τ)3.78
8
Video Detailed CaptioningVideoDetailedCaption
Tau (τ)3.88
8
Showing 10 of 12 rows

Other info

Follow for update