Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

About

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim• 2025

Related benchmarks

Task	Dataset	Result
Facial Understanding	FaceFocalDesc	Classification Score70.09	18
Multimodal Facial Understanding	FaceFocalDesc (test)	Accuracy (Cls)63.61	10
Multimodal Comment Generation	HotComment (test)	BLEU-113.82	9
Language Reasoning	DeepAccident-CCoT (val)	Accuracy69.8	6
Risk Prediction	DeepAccident-CCoT (val)	Accuracy75.6	6
Trajectory Planning	DeepAccident-CCoT (val)	L2 Distance @ 1s (m)0.7	6
Image Captioning	FaceFocalDesc	BS-P53.63	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord