iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

About

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2 times throughput boost and a 4 times reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.

Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy76.11	1453
Video Understanding	MVBench	Accuracy68.9	563
Visual Question Answering	ChartQA	Accuracy64.68	519
Optical Character Recognition	OCRBench	Score752	433
Multimodal Understanding	MMMU (val)	MMMU Score58.2	199
Information Visual Question Answering	InfoVQA	Accuracy71.09	110
Mathematical Visual Question Answering	MathVista	Accuracy54.9	87
Multimodal Understanding	MMBench EN v1.1	Accuracy81.7	63
Video Understanding	VideoMME	Accuracy (No Subtitles)63	60
Multi-modal Understanding	MMBench EN	Overall Score83	55

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord