Multi-Modal Manipulation via Multi-Modal Policy Consensus
About
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vase Wiping | Vase Wiping 30 Demos Flexiv Rizon4 Single-arm 1.0 (test) | Task Score40.5 | 13 | |
| Chip Handover | Chip Handover 50 Demos Bi-Arx5 Dual-arm 1.0 (test) | Success Rate15 | 13 | |
| Multi-task Performance Aggregation | Combined Five Tasks (Shoe Lacing, Chip Handover, Cucum. Peeling, Vase Wiping, Lock Opening) 1.0 (average) | Average Performance24.7 | 13 | |
| Shoe Lacing | Shoe Lacing 100 Demos, Bi-Arx5 Dual-arm 1.0 (test) | Success Rate0.00e+0 | 13 | |
| Cucumber Peeling | Cucumber Peeling 50 Demos, Bi-Arx5 Dual-arm 1.0 (test) | Task Score63 | 13 | |
| Lock Opening | Lock Opening 20 Demos Flexiv Rizon4 Single-arm 1.0 (test) | Success Rate5 | 13 | |
| Robotic Manipulation | Weight-Based Bottle Placement | Success Rate15 | 7 | |
| Robotic Manipulation | Manipulation Task Suite Bottle, Connector, Lid | Average Success Rate54 | 7 | |
| Robotic Manipulation | Twisty Connector Pull Out | Success Rate1 | 7 | |
| Robotic Manipulation | Egg Boiler Lid Opening | Success Rate0.55 | 7 |