VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

About

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are rendered directly within the native RGB observation space as modality-consistent visual prompts, such as crosshairs and bounding boxes. This avoids the modality mismatch introduced by dense masks, affordance maps, or additional control-specific representations. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Extensive experiments in simulation and real world demonstrate that VP-VLA surpasses state-of-the-art end-to-end baselines including QwenOFT and GR00T-N1.6. Project page: https://visualprompt-vla.github.io/

Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, Bei Yu, Jiaya Jia• 2026

Related benchmarks

Task	Dataset	Result	Rank
Robot Manipulation	SimplerEnv WidowX	Overall Success Rate58.3		123
Robotic Manipulation	RoboCasa GR1 Tabletop	Average Success Rate53.8		24

Showing 2 of 2 rows

Other info

GitHub

Follow for update

@wizwand_team Discord