Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Visual Planning: Let's Think Only with Images

About

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vuli\'c• 2025

Related benchmarks

TaskDatasetResultRank
Visual PlanningFrozenLake
EM (%)91.6
8
Visual PlanningMaze
EM74.5
8
Visual PlanningMINIBEHAVIOR
EM7.58e+3
8
Visual PlanningAVG. (FROZENLAKE, MAZE, MINIBEHAVIOR)
EM0.806
8
Maze NavigationMAZENAVIGATION 3x3 - 6x6 1.0 (In Distribution)
EM73.5
7
Maze NavigationMAZENAVIGATION OOD Path Length 6x6 Long 1.0
Exact Match200
7
Maze NavigationMAZENAVIGATION OOD Maze Sizes 7x7 1.0
EM14
7
Maze NavigationMAZENAVIGATION OOD Maze Sizes 8x8 1.0
EM4
7
Maze NavigationMAZENAVIGATION OOD Path Length 5x5 Long 1.0
EM0.00e+0
7
Maze NavigationMAZENAVIGATION 7x7 Long OOD Both 1.0
EM0.00e+0
7
Showing 10 of 11 rows

Other info

Follow for update