Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VILA$^2$: VILA Augmented VILA

About

While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.

Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy82.9
1165
Multimodal UnderstandingMM-Vet
MM-Vet Score50
418
Multi-discipline Multimodal UnderstandingMMMU
Accuracy38.3
266
Multimodal UnderstandingSEED-Bench--
203
Multimodal UnderstandingMMBench CN
Accuracy71.7
162
Hallucination EvaluationPOPE
Accuracy86.7
132
Multimodal UnderstandingMMMU (val)
MMMU Score53
111
Multimodal UnderstandingMMMU (test)
MMMU Score47.9
86
Science Question AnsweringScienceQA SQA-I
Accuracy87.6
81
Multimodal UnderstandingMMBench (MMB)
Accuracy76.6
69
Showing 10 of 14 rows

Other info

Code

Follow for update