MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

About

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

Shristi Das Biswas, Kaushik Roy• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	TextVQA	Accuracy58.8	1455
Visual Question Answering	VQA v2	--	1429
Diagram Question Answering	AI2D	--	509
Text-based Visual Question Answering	TextVQA (val)	Accuracy56.4	276
Multimodal Evaluation	MM-Vet	Score36.3	249
Multi-modal Evaluation	MME	MME Score1.35e+3	240
Visual Question Answering	GQA (test-dev)	Accuracy60.4	236
Visual Question Answering	VQAv2	Accuracy78.9	226
Real-world Visual Question Answering	RealworldQA	--	183

Showing 10 of 35 rows

Other info

Follow for update

@wizwand_team Discord