Play to Generalize: Learning to Reason Through Game Play
About
Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, on multi-discipline questions like MMMU and on 3D spatial reasoning benchmarks like VSI-Bench, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Reasoning | BLINK | Accuracy55.6 | 76 | |
| Vision-centric Reasoning | RealworldQA | Accuracy66.5 | 38 | |
| Vision Understanding | MMVP | Accuracy74.6 | 33 | |
| Vision-centric Reasoning | MMVP | Accuracy74.6 | 21 | |
| Visual Understanding | BLINK | Accuracy55.6 | 21 | |
| Visual Understanding | MMStar | Accuracy (Clean)62.6 | 16 | |
| Chart Understanding | ChartXiv-RQ | Accuracy41.8 | 16 | |
| Chart Understanding | ReachQA | -- | 16 | |
| Reasoning and Math | VLMEvalKit (test) | MathVista Accuracy71.9 | 13 | |
| Vision-centric Reasoning | MuirBench | Accuracy57.8 | 11 |