CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

About

Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.

Donghee Lee, Rui Cai, Zhe Zhao• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Image Classification	Flowers102	Accuracy16.7	558
Image Classification	Food101	Accuracy37.7	457
Scientific Question Answering	ScienceQA image	Accuracy68.4	259
Image Classification	Caltech101	Accuracy65.6	228
Multimodal Model Evaluation	MMBench	Accuracy64.8	204
Visual Perception	MMVP	Accuracy65	118
Multimodal Model Evaluation	MME	Total Score1.86e+3	71
Vision-centric Evaluation	CV-Bench	Accuracy0.588	21
Visual Question Answering	TextVQA	Accuracy57.4	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord