Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

About

Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token's tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.

Kibum Kim, Jiwan Kim, Kyle Min, Yueqi Wang, Jinyoung Moon, Julian McAuley, Chanyoung Park• 2026

Related benchmarks

TaskDatasetResultRank
Fine-grained Video UnderstandingEventHallusion
Binary Accuracy64.06
22
Fine-grained Video UnderstandingVideoComp
Action Accuracy71.04
22
Fine-grained Video UnderstandingVCG Bench
Consistency Score3.23
22
Video Question AnsweringVideoMME
Accuracy (Short Length)67.89
19
Multiple-choice Question AnsweringNextQA
Accuracy80.38
15
Multiple-choice Question AnsweringLongVideoBench
Accuracy55.12
15
Multiple-choice Question AnsweringMVBench
Accuracy57.84
15
Multiple-choice Question AnsweringMLVU
Accuracy63.75
15
Showing 8 of 8 rows

Other info

Follow for update