HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

About

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.3	2019
Visual Question Answering	VizWiz	Accuracy53.4	1820
Visual Question Answering	TextVQA	Accuracy58.5	1453
Visual Question Answering	VQA v2	Accuracy79.1	1429
Visual Question Answering	GQA	Accuracy62.7	1425
Multimodal Evaluation	MME	Score1.49e+3	727
Multimodal Understanding	MM-Vet	MM-Vet Score52.1	631
Multimodal Reasoning	MM-Vet	MM-Vet Score31	517
Multimodal Understanding	SEED-Bench	Accuracy61.4	516
Object Hallucination	POPE Popular	F1 Score86.2	372

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord