Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generative Visual Instruction Tuning

About

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingSEED-Bench--
203
Multimodal BenchmarkingMMBench
Score65
62
Multimodal BenchmarkingMMMU
Accuracy29.7
15
Showing 3 of 3 rows

Other info

Follow for update