Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

About

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	LongVideoBench	--	123
Video Understanding	VideoMME	Wall-time Speedup2.02	21
Image Captioning	COCO Captions	Average Accepted Length (tau)4.13	10
Multimodal Question Answering	ScienceQA (SQA)	Avg Accepted Length3.86	10
Multimodal Understanding	MME	Avg Accepted Length (tau)3.78	10
Multimodal Understanding	MM-Vet	Average Accepted Length (tau)3.82	10
Speculative Decoding	MVBench	Tau (τ)3.87	8
Speculative Decoding	VideoDetailCaption ~17k visual tokens	Tau (τ)3.89	8
Speculative Decoding	LongVideoBench ~15k visual tokens	Tau (τ)3.78	8
Video Detailed Captioning	VideoDetailedCaption	Tau (τ)3.88	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord