ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

About

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

Surendra Pathak, Bo Han• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VQA v2	Accuracy79.46	1429
Text-based Visual Question Answering	TextVQA	Accuracy60.69	962
Multimodal Evaluation	MME	Score1.55e+3	727
Science Question Answering	ScienceQA (SQA)	Accuracy72.89	273
Multimodal Benchmarking	MMBench CN	Score82.47	151
Multimodal Benchmark	MMBench (MMB)	Accuracy68.29	95
Visual Reasoning	GQA	Accuracy62.43	93
Visual Question Answering	GQA	GQA Score62.31	53
Multimodal Understanding	VQAv2, GQA, VQAText, MMB, MMVet	VQAv2 Accuracy81.19	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord