Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

About

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara• 2025

Related benchmarks

Task	Dataset	Result
Optical Character Recognition	OCRBench	--	433
Visual Reasoning	BLINK	Accuracy49.6	107
Multimodal Visual Perception	MMVP	Accuracy38	72
Real-world Question Answering	RealworldQA	--	58
3D Computer Vision Benchmarking	CVBench3D	Accuracy73	24
Visual Reasoning	Vision-Centric Benchmarks	BLINK Score50	20
Optical Character Recognition	OCR	Average Score37	20
2D Computer Vision Benchmarking	CVBench2D	Accuracy63.9	13
General Multimodal Understanding	General Benchmarks	Average Score74	12
Knowledge-based Visual Question Answering	Knowledge Benchmarks	Average Score48.2	12

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord