Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

About

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan• 2026

Related benchmarks

Task	Dataset	Result
Box Open	Real-world Franka Emika	Success Rate1	9
Bulb-Unscrew	Real-world Franka Emika	Success Rate9	9
Peg-Insert	Real-world Franka Emika	Success Rate100	9
Robotic Manipulation	Real-world Box-Open Position OOD v1	Success Rate100	6
Robotic Manipulation	Real-world Box-Open Object OOD v1	Success Rate90	6
Reward Model Evaluation	Meta-World (train)	Procedural Alignment Correlation (ρ)0.97	5
Reward Model Evaluation	Meta-World Position OOD	Process Alignment ρ0.85	5
Reward Model Evaluation	Meta-World Viewpoint OOD	Process Alignment ρ0.88	5
Reward Model Evaluation	Meta-World Object OOD	Process Alignment Correlation (ρ)0.81	5
Robotic Manipulation	Real-world Box-Open Viewpoint OOD v1	Success Rate15	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord