ASAP: Attention Sink Anchored Pruning

About

Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.

Jaehyuk Lee, Hanyoung Kim, Yanggee Kim, Donghun Lee• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	VQA v2	Accuracy75.9	1429
Image Classification	ImageNet-1K	Accuracy (Base)83.35	23
Multimodal Understanding	MMBench MMB EN	Score65.8	22
Visual Question Answering	GQA	GQA Score62.3	20
Multimodal Visual Question Answering	LLaVA Evaluation Suite (GQA, MME, POPE, SQA-Img, VizWiz, VQAv2, MMB-En) 1.5	GQA60.4	16
Multimodal Understanding	MME	MME Score1.85e+3	10
Object Hallucination Evaluation	POPE	Inference (ms)64	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord