Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

About

Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the most important tokens accounting for the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a `less is more' pattern, where \textit{utilizing fewer but more informative local image tokens leads to improved performance}. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, Rong Jin• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy64.7	1455
Visual Question Answering	GQA	Accuracy63.9	1445
Multimodal Evaluation	MME	--	902
Multimodal Understanding	MM-Vet	MM-Vet Score35.4	664
Multimodal Capability Evaluation	MM-Vet	Score37.4	429
Visual Question Answering	TextVQA (val)	VQA Score64.4	371
Hallucination Evaluation	POPE	Accuracy85.4	281
Multimodal Model Evaluation	MMBench	Accuracy75	265
Visual Question Answering	VQAv2	Accuracy80.3	226
Multimodal Evaluation	MMBench CN	Accuracy71.8	163

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord