Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

About

Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringTextVQA
Accuracy78.56
1285
Video UnderstandingMVBench
Accuracy75.3
425
Visual Question AnsweringChartQA
Accuracy76.36
371
Multimodal UnderstandingMMStar
Accuracy54.8
324
Diagram Question AnsweringAI2D
AI2D Accuracy78.4
232
Video UnderstandingVideoMME--
222
Video UnderstandingMLVU
Score61.1
221
Video UnderstandingEgoSchema
EgoSchema Score61
158
Document Visual Question AnsweringDocVQA (val)
Accuracy74
157
Showing 10 of 27 rows

Other info

Follow for update