Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

About

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	ManiSkill3	Average Success Rate56.56	33
Reward alignment	RBM-EVAL ID	Pearson r (VOC)0.92	14
Reward alignment	RBM-EVAL OOD	Pearson r (VOC)0.95	14
Reward Modeling	RBM ID (Refined)	VOC r Score81	12
Trajectory Ranking	RBM OOD 1.0 (test)	Kendall's Tau-a0.66	8
Reward Prediction	RoboRewardBench	MAE0.72	8
Reward rollout alignment	10-task benchmark T1: Folding Shorts	Rollout ρ0.667	8
Reward Prediction	10-task benchmark S1 classic	Demo L (MSE)0.032	8
Robotic Task Perception	RoboFAC real-robot	VOC Success Rate41.29	8
Progress Estimation	ProcVQA-OOD progress estimation	VOC Score52.96	8

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord