Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

About

Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.

Donghee Lee, Rui Cai, Zhe Zhao• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Image ClassificationFlowers102
Accuracy16.7
558
Image ClassificationFood101
Accuracy37.7
457
Image ClassificationCaltech101
Accuracy65.6
228
Scientific Question AnsweringScienceQA image
Accuracy68.4
184
Multimodal Model EvaluationMMBench
Accuracy64.8
180
Visual PerceptionMMVP
Accuracy65
82
Multimodal Model EvaluationMME
Total Score1.86e+3
71
Vision-centric EvaluationCV-Bench
Accuracy0.588
21
Visual Question AnsweringTextVQA
Accuracy57.4
7
Showing 10 of 10 rows

Other info

Follow for update