Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
About
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Insertion | Simulation | Insertion Success Rate19.3 | 14 | |
| Lift | Simulation Capsule Shape | Success Rate87.5 | 7 | |
| Lift | Simulation | Success Rate76.7 | 7 | |
| Lift | Simulation Cylinder Shape | Success Rate75.8 | 7 | |
| Block Rotate | Simulation | Success Rate4.4 | 7 | |
| Door | Simulation | Success Rate1 | 7 | |
| Pen Rotate | Simulation | Success Rate2.9 | 7 | |
| Block Spin | Simulation | Success Rate15.7 | 7 | |
| Egg Rotate | Simulation | Success Rate0.7 | 7 | |
| Insertion | Simulation Noisy | Success Rate0.269 | 7 |