Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visual Compositional Tuning

About

Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a compositional VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective VIT. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Furthermore, training on the COMPACT data outperforms training on the full-scale VIT data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on vision-language tasks.

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal Capability EvaluationMM-Vet
Score31.74
393
Multi-discipline Multimodal UnderstandingMMMU
Accuracy33.89
363
Visual Question AnsweringOK-VQA
Accuracy50.02
272
Visual Question AnsweringInfoVQA
Accuracy23.68
195
Multimodal UnderstandingSEEDBench2 Plus
Accuracy43.13
138
Multi-discipline Multimodal UnderstandingMMMU-Pro--
66
Visual ReasoningMMStar
Accuracy36.13
51
Computer Vision BenchmarkingCVBench
Accuracy55.28
16
Visual Perception and CognitionMME
Score1.38e+3
10
Open-Ended Visual Question AnsweringLLaVA-in-the-Wild (LLaVA-W)
Score64.5
10
Showing 10 of 10 rows

Other info

Follow for update