Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

About

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

Shristi Das Biswas, Kaushik Roy• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringTextVQA
Accuracy58.8
1453
Visual Question AnsweringVQA v2--
1429
Diagram Question AnsweringAI2D--
387
Text-based Visual Question AnsweringTextVQA (val)
Accuracy56.4
276
Visual Question AnsweringGQA (test-dev)
Accuracy60.4
236
Multimodal EvaluationMM-Vet
Score36.3
196
Visual Question AnsweringVQAv2
Accuracy78.9
196
Real-world Visual Question AnsweringRealworldQA--
173
Multi-modal EvaluationMME
MME Score1.35e+3
160
Showing 10 of 35 rows

Other info

Follow for update