Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Diffusion Policy through Conditional Proximal Policy Optimization

About

Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.

Ben Liu, Shunpeng Yang, Hua Chen• 2026

Related benchmarks

TaskDatasetResultRank
FingerSpinPlayground
Average Reward912.5
4
PointMassPlayground
Average Reward860.3
4
Reach-EPlayground
Average Reward918.9
4
Reach-HPlayground
Average Reward873.4
4
Robot ControlIsaacLab
Ant Score248.5
4
WlkRunPlayground
Average Reward660.5
4
WlkWalkPlayground
Average Reward894
4
BallCupPlayground
Average Reward866.2
4
CheetahPlayground
Average Reward848.1
4
Showing 9 of 9 rows

Other info

Follow for update