Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

About

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content. The framework's FiCoCo-L variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the FiCoCo series effectively accelerates a range of MLLMs, achieves up to 14.7x FLOPs reduction with 93.6% performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. Code: https://github.com/kawhiiiileo/FiCoCo

Yuhang Han, Xuyang Liu, Zihan Zhang, Pengxiang Ding, Junjie Chen, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy82.1
1455
Visual Question AnsweringVQA v2
Accuracy69.7
1362
Visual Question AnsweringGQA--
1249
Text-based Visual Question AnsweringTextVQA
Accuracy55.7
807
Visual Question AnsweringGQA
Accuracy53.2
505
Multimodal UnderstandingMMBench CN
Accuracy53.3
174
Multimodal UnderstandingMMBench (MMB)
Accuracy61.5
141
Science Question AnsweringScienceQA SQA-IMG
Accuracy69.5
139
Showing 8 of 8 rows

Other info

Follow for update