Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

About

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy53.4
1525
Object Hallucination EvaluationPOPE
Accuracy86.3
1455
Visual Question AnsweringVQA v2
Accuracy79.1
1362
Visual Question AnsweringTextVQA
Accuracy58.5
1285
Visual Question AnsweringGQA
Accuracy62.7
1249
Multimodal EvaluationMME
Score1.49e+3
658
Multimodal UnderstandingMM-Vet
MM-Vet Score52.1
531
Multimodal ReasoningMM-Vet
MM-Vet Score31
431
Multimodal UnderstandingSEED-Bench
Accuracy61.4
343
Science Question AnsweringScienceQA IMG
Accuracy70.4
294
Showing 10 of 20 rows

Other info

Code

Follow for update