Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

About

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement87.1
494
Robot ManipulationLIBERO (test)
Average Success Rate84.4
142
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)58.3
79
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate92
62
Pick CanSimplerEnv Google Robot embodiment
Success Rate92
28
Robot ManipulationSimplerEnv Google Robot Visual Matching
Pick Coke Can92
28
Move NearSimplerEnv Google Robot embodiment
Success Rate72.4
28
Drawer OpeningSimplerEnv Google Robot embodiment (test)
Success Rate50
28
Robotic ManipulationSimplerEnv Google Robot - Visual Aggregation
Pick Coke Can84
28
Robotic ManipulationLIBERO v1 (test)
Config 10 Score70.9
27
Showing 10 of 34 rows

Other info

Follow for update