Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Maximum Entropy Reinforcement Learning with Diffusion Policy

About

The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.

Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Continuous ControlWalker2D v5
Avg Return3.63e+3
17
Continuous ControlHopper v5
Average Return3.00e+3
15
Continuous ControlHumanoid v5
Average Return3.08e+3
13
Continuous ControlInvertedPendulum v5
Average Episodic Reward1.00e+3
8
Continuous ControlReacher v5
Average Episodic Reward-4.4
8
Continuous ControlSwimmer v5
Average Episodic Reward75.8
8
Continuous ControlPusher v5
Final Return-42.3
6
Continuous ControlAnt v5
Final Return2.89e+3
6
Continuous ControlHalfcheetah v5
Final Return6.79e+3
6
Continuous ControlInverted2Pendulum v5
Final Return8.58e+3
6
Showing 10 of 10 rows

Other info

Follow for update