LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

About

In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85	2019
Visual Question Answering	VizWiz	Accuracy35.9	1820
Visual Question Answering	TextVQA	Accuracy48.6	1453
Visual Question Answering	VQA v2	Accuracy71.4	1429
Visual Question Answering	GQA	Accuracy56.5	1425
Multimodal Understanding	MMBench	--	847
Science Question Answering	ScienceQA	--	791
Multimodal Evaluation	MME	Score1.34e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy71.4	712
Multimodal Understanding	MM-Vet	MM-Vet Score28.9	631

Showing 10 of 59 rows

Other info

Code

Follow for update

@wizwand_team Discord