Play to Generalize: Learning to Reason Through Game Play

About

Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, on multi-discipline questions like MMMU and on 3D spatial reasoning benchmarks like VSI-Bench, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.

Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei• 2025

Related benchmarks

Task	Dataset	Result
Visual Reasoning	BLINK	Accuracy55.6	107
Vision-centric Reasoning	RealworldQA	Accuracy66.5	66
Vision Understanding	MMVP	Accuracy74.6	36
Vision-centric Reasoning	MMVP	Accuracy74.6	21
Visual Understanding	BLINK	Accuracy55.6	21
Visual Understanding	MMStar	Accuracy (Clean)62.6	16
Chart Understanding	ChartXiv-RQ	Accuracy41.8	16
Chart Understanding	ReachQA	--	16
Reasoning and Math	VLMEvalKit (test)	MathVista Accuracy71.9	13
Vision-centric Reasoning	MuirBench	Accuracy57.8	11

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord