Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

About

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei• 2023

Related benchmarks

TaskDatasetResultRank
Robot ManipulationReal-world Robot Environment 1.0 (test)
Success Rate1.00e+4
24
6-DoF Object RearrangementOpen6DOR V2 (test)
Position Accuracy32.6
8
Robotic ManipulationRLBench
Pick Up Cup Success Rate20
7
3D Spatial ReasoningOpen6DOR Position
Level 0 Score35.6
7
Low-level policy sampling5 typical action models (Pick, Place, Open, Close, Toggle)
Success Rate22
7
6-DoF Object RearrangementOpen6DOR Isaac Sim V1
Position Tracking Error (Level 0)35.6
6
Robotic Manipulation8 Real-world Tasks 20 repetitions (test)
Place Food Success Rate70
6
Long-horizon manipulationRLBench
Bridge Between Towers Success Rate8
5
Stack bowlsReal-world Robotic Tasks
Success Rate20
4
Non-toppling pushRobotic Manipulation Tasks (real-world)
Success Rate0.00e+0
4
Showing 10 of 24 rows

Other info

Code

Follow for update