Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Diffusion Policy Policy Optimization

About

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io

Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz• 2024

Related benchmarks

TaskDatasetResultRank
Dexterous ManipulationDexterous Manipulation Simulation (test)
Grasping41
12
CanRoboMimic MH 100 trajectories Simplified (multi-human)
Success Rate100
5
LiftRoboMimic MH 100 trajectories Simplified (multi-human)
Success Rate100
5
Online Reinforcement LearningDMControl FingerSpin (final)
Normalized Return694.1
5
Online Reinforcement LearningWalkerWalk DMControl (final)
Normalized Return345.6
5
Online Reinforcement LearningCheetahRun DMControl (final)
Normalized Return559.8
5
Online Reinforcement LearningFingerTurnHard DMControl (final)
Normalized Return633.8
5
Online Reinforcement LearningFishSwim DMControl (final)
Normalized Return143.5
5
Online Reinforcement LearningHopperStand DMControl (final)
Normalized Return2.14
5
SquareRoboMimic MH 100 trajectories Simplified (multi-human)
Success Rate78
4
Showing 10 of 11 rows

Other info

Follow for update