PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

About

Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC->D, and a 2x improvement over real-world baselines across three long-horizon generalization settings.

Yuanzhe Liu, Jingyuan Zhu, Yuchen Mo, Gen Li, Xu Cao, Jin Jin, Yifan Shen, Zhengyuan Li, Tianjiao Yu, Wenzhen Yuan, Fangqiang Ding, Ismini Lourentzou• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO (test)	Average Success Rate94.5	220
Instruction-following robotic manipulation	CALVIN ABC→D (unseen environment D)	Success Rate (Length 1)96.9	29
Long-horizon robotic manipulation	Real-world Random Localization	Success Rate (Step 1)70	3
Long-horizon robotic manipulation	Real-world (Visual Distraction)	Success Rate (Step 1)85	3
Long-horizon robotic manipulation	Real-world Unseen Lighting	Success Rate (Step 1)80	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord