Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

About

Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.

Qitan Lv, Tianyu Liu, Wen Wu, Xuenan Xu, Bowen Zhou, Feng Wu, Chao Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
247
Video UnderstandingVideoMME--
192
Video UnderstandingLongVideoBench--
79
Video UnderstandingVideoMME
Wall-time Speedup3.27
21
Video UnderstandingVDC
MAT8.05
16
Video UnderstandingMLVU
MAT12.42
16
Video UnderstandingLVBench
MAT12.12
16
Video UnderstandingVDC
Wall-time Speedup2.31
4
Video UnderstandingMLVU
Wall-time Speedup2.78
4
Showing 9 of 9 rows

Other info

Follow for update