Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

About

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Evaluation	MME	MME-P Score1.50e+3	139
Multimodal Benchmarking	MMBench	Score73.1	73
Mathematical Reasoning	MathVista	MathVista54	55
Science Question Answering	SQA	SQA Score84.4	36
Multimodal Reasoning	Multiple Evaluation Benchmarks Aggregate (test)	Relative Average Performance101.3	24
Hallucination Detection	HallusionBench	Hallusion Score44.6	20
Multi-task and Multi-image Reasoning	MMT-Bench	SI Score57.5	11
Mathematical Vision Reasoning	MathVision	Score (MINI)16.1	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord