LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

About

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li• 2023

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	SEED-Bench	--	516
Multimodal Understanding	MM-VET (test)	--	120
Multi-modal Understanding	MMVet	Accuracy35	55
Multimodal Instruction Following	ViSiT-Bench Sept. 27th, 2023 (Leaderboard)	ELO1.20e+3	15
Visual Understanding	BLINK sub-tasks	Jigsaw Accuracy58	14
Large Multi-modal Model Evaluation	LLaVA-Bench Tool Use (test)	Grounding0.893	8
Multimodal Tool Use	LLaVA-Bench Tool Use	Grounding89.3	8
Large Multi-modal Model Evaluation	LLaVA-Bench In-the-Wild v1	Conversational Score65.5	6
Large Multi-modal Model Evaluation	LLaVA-Bench COCO v1	Conv Score0.816	6
Image Captioning	COCO Caption	BLEU-150.8	3

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord