Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

About

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Mohit Shridhar, Lucas Manuelli, Dieter Fox• 2022

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationRLBench
Place Cups Success2.4
63
Robotic ManipulationRLBench (test)
Average Success Rate49.4
49
Peg InsertionReal-world
Success Rate63.5
25
Bimanual ManipulationRLBench 2
Push Box Success Rate66.3
20
Robotic ManipulationCOLOSSEUM
Avg SR27.9
20
Bimanual Robot ManipulationTWIN 1.0 (test)
Push Box Success Rate57
18
Multi-task Robotic ManipulationRLBench
Avg Success Rate52.3
16
Pick-&-PlaceReal-world
Success Rate82.1
15
Tool UsageReal-world tool usage
Success Rate48.7
13
Robotic ManipulationRLBench standard (test)
Reach Target Success Rate100
12
Showing 10 of 52 rows

Other info

Code

Follow for update