Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
About
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Facial Understanding | FaceFocalDesc | Classification Score70.09 | 18 | |
| Multimodal Facial Understanding | FaceFocalDesc (test) | Accuracy (Cls)63.61 | 10 | |
| Multimodal Comment Generation | HotComment (test) | BLEU-113.82 | 9 | |
| Language Reasoning | DeepAccident-CCoT (val) | Accuracy69.8 | 6 | |
| Risk Prediction | DeepAccident-CCoT (val) | Accuracy75.6 | 6 | |
| Trajectory Planning | DeepAccident-CCoT (val) | L2 Distance @ 1s (m)0.7 | 6 | |
| Image Captioning | FaceFocalDesc | BS-P53.63 | 5 |