Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

About

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. Project page is available at https://cvlab.yonsei.ac.kr/projects/RESTORE

Hyeonwoo Cho, Donghyeon Baek, Yewon Kim, Bumsub Ham• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.6	2056
Text-based Visual Question Answering	TextVQA	Accuracy57.2	984
Multimodal Understanding	MMBench	Accuracy63.7	887
Multimodal Understanding	SEED-Bench	Accuracy58.2	571
Visual Question Answering	VQA v2	Accuracy77.6	347
Scientific Question Answering	ScienceQA image	Accuracy69.6	281
Visual Question Answering	GQA	Accuracy61	218
Multimodal Evaluation	MME	MME Score1.82e+3	179
Text-based Visual Question Answering	TextVQA	VQAText Score56.1	35
Fine-grained Visual Perception	OCRBench (test)	OCRBench Score301	24

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord