MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer
About
While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at https://github.com/buduz/MOTIF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-Embodiment Transfer | Real-world Cross-Embodiment 5-shot | Push Cube70 | 6 | |
| Cross-Embodiment Transfer | ManiSkill Simulation 1-Shot | Transfer Success Rate36 | 5 | |
| Cross-Embodiment Transfer | ManiSkill Simulation 3-Shot | Transfer Success Rate48.33 | 5 | |
| Cross-Embodiment Transfer | ManiSkill Simulation 5-Shot | Transfer Success Rate54.33 | 5 | |
| Cross-Embodiment Transfer | ManiSkill Simulation 10-Shot | Transfer Success Rate60.33 | 5 | |
| Cross-Embodiment Transfer | ManiSkill Simulation 50-Shot | Transfer Success Rate75 | 5 |