Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

About

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2 times throughput boost and a 4 times reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.

Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng• 2024

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy68.9
356
Multimodal UnderstandingMMMU (val)
MMMU Score58.2
152
Multimodal UnderstandingMMBench EN v1.1
Accuracy81.7
63
Video UnderstandingVideoMME
Accuracy (No Subtitles)63
60
Multi-modal UnderstandingMMBench EN
Overall Score83
55
Video UnderstandingEgoSchema (test)
Accuracy63.3
55
Multimodal UnderstandingMME
Sum Score2.33e+3
39
Multi-modal UnderstandingMuirBench
Score59.1
16
Multimodal UnderstandingMMStar (test)
Score62.8
16
Visual Question AnsweringMMVet (test)
Score66
16
Showing 10 of 15 rows

Other info

Follow for update