ApET: Approximation-Error Guided Token Compression for Efficient VLMs

About

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.2	2019
Visual Question Answering	VizWiz	Accuracy51.9	1820
Visual Question Answering	VQA v2	Accuracy76.2	1429
Text-based Visual Question Answering	TextVQA	Accuracy54.4	962
Multimodal Understanding	MMBench	Accuracy63.4	847
Science Question Answering	ScienceQA	Accuracy68.9	791
Multimodal Evaluation	MME	Score2.18e+3	727
Visual Question Answering	GQA	Accuracy63	524
Multimodal Reasoning	MM-Vet	MM-Vet Score29.4	517
Multimodal Understanding	SEED-Bench	Accuracy56.8	516

Showing 10 of 45 rows

Other info

Follow for update

@wizwand_team Discord