PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

About

Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy82.4	2019
Visual Question Answering	VizWiz	Accuracy52.8	1820
Visual Question Answering	TextVQA	Accuracy78.56	1453
Multimodal Understanding	MMBench	Accuracy72.2	847
Science Question Answering	ScienceQA	Accuracy78.3	791
Multimodal Evaluation	MME	Score2.01e+3	727
Video Understanding	MVBench	Accuracy75.3	563
Visual Question Answering	ChartQA	Accuracy76.36	519
Multimodal Understanding	MMStar	Accuracy54.8	407
Diagram Question Answering	AI2D	AI2D Accuracy78.4	387

Showing 10 of 48 rows

Other info

Follow for update

@wizwand_team Discord