Boosting Visual Instruction Tuning with Self-Supervised Guidance

About

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.9	2019
Optical Character Recognition	OCRBench	Score634	433
Multimodal Understanding	MMStar	Accuracy55.5	407
Mathematical Multimodal Reasoning	MathVista	Accuracy22.6	258
Visual Perception	BLINK	Accuracy52.2	241
Real-world Visual Question Answering	RealworldQA	Accuracy66.4	173
Object Hallucination Evaluation	POPE (average across random and popular)	--	38
Vision-centric Evaluation	CV-Bench 2D	Score63.8	15
Visual Grounding	CVB 2D	Accuracy71	11
Multi-modal Reasoning	MMStar	MMStar Score43.7	3

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord