ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

About

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

Yiling Gao, Hongchen Wei, Zhenzhong Chen• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy95.33	1429
Multimodal Evaluation	MME	--	902
Multimodal Understanding	SEED-Bench	--	571
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.1873	351
Multi-modal Evaluation	MME	MME Score1.84e+3	240
Referring Expression Comprehension	RefCOCO (testB)	Accuracy37.72	213
Multimodal Reasoning	MMBench	MMBench Accuracy (en)82.75	65
Multimodal Understanding	SEED	SEED Score56.66	51
Science Question Answering	SQA	SQA Score68.76	22
Science Question Answering	ScienceQA	SQA Score85.48	19

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord