Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

About

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	WeMath	Accuracy16.7	199
Multimodal Reasoning	LogicVista	Accuracy40.7	172
Multimodal Reasoning	MathVision	Accuracy28.6	162
Multimodal Reasoning	MathVerse	Accuracy27.3	138
Multimodal Reasoning	MathVista	Accuracy63.7	89
Vision Understanding	MMVP	Accuracy68.33	45
Visual Reasoning	Jigsaw	Accuracy70	44
Visual Reasoning	VSP	Accuracy76	17
Visual Spatial Planning	VSP (test)	Average Accuracy76	17
Visual Reasoning	VisuLogic	Avg Score26.6	16

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord