ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations
About
We introduce ReWiND, a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. Standard reinforcement learning (RL) and imitation learning methods require expert supervision through human-designed reward functions or demonstrations for every new task. In contrast, ReWiND starts from a small demonstration dataset to learn: (1) a data-efficient, language-conditioned reward function that labels the dataset with rewards, and (2) a language-conditioned policy pre-trained with offline RL using these rewards. Given an unseen task variation, ReWiND fine-tunes the pre-trained policy using the learned reward function, requiring minimal online interaction. We show that ReWiND's reward model generalizes effectively to unseen tasks, outperforming baselines by up to 2.4x in reward generalization and policy alignment metrics. Finally, we demonstrate that ReWiND enables sample-efficient adaptation to new tasks, beating baselines by 2x in simulation and improving real-world pretrained bimanual policies by 5x, taking a step towards scalable, real-world robot learning. See website at https://rewind-reward.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Box Open | Real-world Franka Emika | Success Rate1 | 9 | |
| Peg-Insert | Real-world Franka Emika | Success Rate100 | 9 | |
| Bulb-Unscrew | Real-world Franka Emika | Success Rate0.00e+0 | 9 | |
| Reward alignment | RBM-EVAL ID | Pearson r (VOC)0.46 | 8 | |
| Reward alignment | RBM-EVAL OOD | Pearson r (VOC)0.51 | 8 | |
| Trajectory Ranking | RBM OOD 1.0 (test) | Kendall's Tau-a0.01 | 8 | |
| Reward Modeling | D_dish real policy rollouts (test) | Rollout ρ0.55 | 6 | |
| Reward Modeling | D_dish (val) | Demo Loss0.018 | 6 | |
| Robotic Manipulation | Real-world Box-Open Position OOD v1 | Success Rate50 | 6 | |
| Robotic Manipulation | Real-world Box-Open Object OOD v1 | Success Rate0.00e+0 | 6 |