Generative Visual Instruction Tuning

About

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	SEED-Bench	--	516
Multimodal Benchmarking	MMBench	Score65	73
Multimodal Benchmarking	MMMU	Accuracy29.7	15

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord