VILA$^2$: VILA Augmented VILA

About

While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.

Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy82.9	1429
Multimodal Understanding	MM-Vet	MM-Vet Score50	631
Multimodal Understanding	SEED-Bench	--	516
Multi-discipline Multimodal Understanding	MMMU	Accuracy38.3	363
Streaming Video Understanding	StreamingBench	--	259
Multimodal Understanding	MMBench CN	Accuracy71.7	254
Hallucination Evaluation	POPE	Accuracy86.7	217
Multimodal Understanding	MMMU (val)	MMMU Score53	199
Multimodal Understanding	MMBench (MMB)	Accuracy76.6	166
Science Question Answering	ScienceQA SQA-I	Accuracy87.6	122

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord