Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

About

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot.

Michelle A. Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg• 2019

Related benchmarks

TaskDatasetResultRank
InsertionSimulation
Insertion Success Rate11.4
14
Dual Arm LiftSimulation
Success Rate92.6
7
LiftSimulation
Success Rate71.9
7
LiftSimulation Capsule Shape
Success Rate58.1
7
Block SpinSimulation
Success Rate20.6
7
Egg RotateSimulation
Success Rate0.9
7
LiftSimulation Cylinder Shape
Success Rate52.6
7
Block RotateSimulation
Success Rate0.8
7
DoorSimulation
Success Rate0.982
7
InsertionSimulation Noisy
Success Rate0.207
7
Showing 10 of 12 rows

Other info

Follow for update