LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
About
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Multimodal Understanding | MM-VET (test) | -- | 114 | |
| Multi-modal Understanding | MMVet | Accuracy35 | 35 | |
| Multimodal Instruction Following | ViSiT-Bench Sept. 27th, 2023 (Leaderboard) | ELO1.20e+3 | 15 | |
| Large Multi-modal Model Evaluation | LLaVA-Bench Tool Use (test) | Grounding0.893 | 8 | |
| Multimodal Tool Use | LLaVA-Bench Tool Use | Grounding89.3 | 8 | |
| Large Multi-modal Model Evaluation | LLaVA-Bench In-the-Wild v1 | Conversational Score65.5 | 6 | |
| Large Multi-modal Model Evaluation | LLaVA-Bench COCO v1 | Conv Score0.816 | 6 | |
| Image Captioning | COCO Caption | BLEU-150.8 | 3 |