The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning

About

Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at https://sferrazza.cc/m3l_site.

Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, Pieter Abbeel• 2023

Related benchmarks

Task	Dataset	Result
Insertion	Simulation	Insertion Success Rate72.1	14
Block Rotate	Simulation	Success Rate11.6	7
Block Spin	Simulation	Success Rate30.8	7
Door	Simulation	Success Rate1	7
Dual Arm Lift	Simulation	Success Rate88.2	7
Egg Rotate	Simulation	Success Rate4.2	7
Pen Rotate	Simulation	Success Rate73.1	7
Insertion	Simulation Noisy	Success Rate0.473	7
Mobile Catch	Simulation	Success Rate15.8	7
Lift	Simulation Capsule Shape	Success Rate54.2	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord