VILA$^2$: VILA Augmented VILA
About
While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy82.9 | 1165 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score50 | 418 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy38.3 | 266 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Multimodal Understanding | MMBench CN | Accuracy71.7 | 162 | |
| Hallucination Evaluation | POPE | Accuracy86.7 | 132 | |
| Multimodal Understanding | MMMU (val) | MMMU Score53 | 111 | |
| Multimodal Understanding | MMMU (test) | MMMU Score47.9 | 86 | |
| Science Question Answering | ScienceQA SQA-I | Accuracy87.6 | 81 | |
| Multimodal Understanding | MMBench (MMB) | Accuracy76.6 | 69 |