TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
About
Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement75.1 | 494 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate74.8 | 142 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)12.5 | 79 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Visual Matching | Pick Coke Can Success Rate28 | 62 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Variant Aggregation | Pick Coke Can Success Rate60 | 44 | |
| Drawer Opening | SimplerEnv Google Robot embodiment (test) | Success Rate63.1 | 28 | |
| Move Near | SimplerEnv Google Robot embodiment | Success Rate63.8 | 28 | |
| Pick Can | SimplerEnv Google Robot embodiment | Success Rate64.3 | 28 | |
| Robotic Manipulation | LIBERO v1 (test) | Config 10 Score54.1 | 27 | |
| General Robot Manipulation | SimplerEnv | Average Success Rate38.6 | 23 |