Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InfMLLM: A Unified Framework for Visual-Language Tasks

About

Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}.

Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringVQA v2
Accuracy82.3
1362
Visual Question AnsweringTextVQA
Accuracy68.02
1285
Visual Question AnsweringGQA
Accuracy65.1
1249
Multimodal EvaluationMME
Score1.49e+3
658
Multimodal UnderstandingSEED-Bench
Accuracy61.7
343
Science Question AnsweringScienceQA IMG
Accuracy68.7
294
Visual Question AnsweringOK-VQA
Accuracy61.33
260
Visual GroundingRefCOCO+ (val)--
212
Visual GroundingRefCOCO+ (testA)
Accuracy92.36
206
Showing 10 of 20 rows

Other info

Code

Follow for update