Maximum Entropy Reinforcement Learning with Diffusion Policy
About
The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Ant v4 | Average Return5.72e+3 | 46 | |
| Continuous Control | MuJoCo Walker2d v4 | -- | 39 | |
| Continuous Control | MuJoCo HalfCheetah v4 | Average Return1.13e+4 | 36 | |
| Continuous Control | MuJoCo Swimmer v4 | Total Reward90.3 | 19 | |
| Continuous Control | Walker2D v5 | Avg Return3.63e+3 | 17 | |
| Continuous Control | Ant v4 | Average Return5.72e+3 | 15 | |
| Continuous Control | Hopper v5 | Average Return3.00e+3 | 15 | |
| Continuous Control | Humanoid v5 | Average Return3.08e+3 | 13 | |
| Continuous Control | InvertedPendulum v5 | Average Episodic Reward1.00e+3 | 8 | |
| Continuous Control | Reacher v5 | Average Episodic Reward-4.4 | 8 |