Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations
About
Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Box Open | Real-world Franka Emika | Success Rate1 | 9 | |
| Bulb-Unscrew | Real-world Franka Emika | Success Rate9 | 9 | |
| Peg-Insert | Real-world Franka Emika | Success Rate100 | 9 | |
| Robotic Manipulation | Real-world Box-Open Position OOD v1 | Success Rate100 | 6 | |
| Robotic Manipulation | Real-world Box-Open Object OOD v1 | Success Rate90 | 6 | |
| Reward Model Evaluation | Meta-World (train) | Procedural Alignment Correlation (ρ)0.97 | 5 | |
| Reward Model Evaluation | Meta-World Position OOD | Process Alignment ρ0.85 | 5 | |
| Reward Model Evaluation | Meta-World Viewpoint OOD | Process Alignment ρ0.88 | 5 | |
| Reward Model Evaluation | Meta-World Object OOD | Process Alignment Correlation (ρ)0.81 | 5 | |
| Robotic Manipulation | Real-world Box-Open Viewpoint OOD v1 | Success Rate15 | 3 |