SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

About

Large-scale robot learning has made progress on complex manipulation tasks, yet long horizon, contact rich problems, especially those involving deformable objects, remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame index based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83% success from the flattened state and 67% from the crumpled state, compared to 8% and 0% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, Philipp Wu• 2025

Related benchmarks

Task	Dataset	Result
Robotic Assembly	AirPods assembly	Grasp Case Success Rate100	10
Hard T-shirt folding	T-shirt folding demonstration dataset Hard	Success Rate8	8
Medium T-shirt folding	T-shirt folding demonstration dataset Medium	Success Rate0.8333	8
Simple T-shirt folding	T-shirt folding demonstration dataset Easy	Success Rate12	8
Folding Shorts	Folding Shorts (Crumble)	Success Rate0.4167	7
Cleaning Whiteboard	Cleaning Whiteboard	Success Rate (SR)0.5	7
Folding Shorts	Folding Shorts Flat	Success Rate0.5	7
Reward Modeling	D_dish (val)	Demo Loss0.013	6
Reward Modeling	D_dish real policy rollouts (test)	Rollout ρ0.67	6
Bimanual T-shirt folding	D4 (D1 ∪ DA)	Success Rate95	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord