Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

About

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2 times throughput boost and a 4 times reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.

Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy76.11
1453
Video UnderstandingMVBench
Accuracy68.9
563
Visual Question AnsweringChartQA
Accuracy64.68
519
Optical Character RecognitionOCRBench
Score752
433
Multimodal UnderstandingMMMU (val)
MMMU Score58.2
199
Information Visual Question AnsweringInfoVQA
Accuracy71.09
110
Mathematical Visual Question AnsweringMathVista
Accuracy54.9
87
Multimodal UnderstandingMMBench EN v1.1
Accuracy81.7
63
Video UnderstandingVideoMME
Accuracy (No Subtitles)63
60
Multi-modal UnderstandingMMBench EN
Overall Score83
55
Showing 10 of 26 rows

Other info

Follow for update