Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

About

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME
MME-P Score1.50e+3
114
Multimodal BenchmarkingMMBench
Score73.1
73
Mathematical ReasoningMathVista
MathVista54
55
Science Question AnsweringSQA
SQA Score84.4
26
Multimodal ReasoningMultiple Evaluation Benchmarks Aggregate (test)
Relative Average Performance101.3
24
Hallucination DetectionHallusionBench
Hallusion Score44.6
20
Multi-task and Multi-image ReasoningMMT-Bench
SI Score57.5
11
Mathematical Vision ReasoningMathVision
Score (MINI)16.1
11
Showing 8 of 8 rows

Other info

Follow for update