Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Diffusion Actor-Critic with Entropy Regulator

About

Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $\alpha$ that modulates the degree of exploration and exploitation. Parameter $\alpha$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, Shengbo Eben Li• 2024

Related benchmarks

TaskDatasetResultRank
Online Reinforcement LearningOpenAI Gym MuJoCo Normalized v4
Normalized Mean Return76
50
Reinforcement LearningHalfCheetah v3
Mean Reward1.72e+4
15
Reinforcement LearningSwimmer v3
Mean Reward152
15
Reinforcement LearningHumanoid v3
Avg Final Return1.19e+4
7
Reinforcement LearningAnt v3
Average Final Return9.11e+3
7
Reinforcement LearningWalker2d v3
Average Final Return6.70e+3
7
Reinforcement LearningInvertedDoublePendulum v3
Average Final Return9.36e+3
7
Reinforcement LearningHopper v3
Average Final Return4.10e+3
7
Reinforcement LearningPusher v2
Average Final Return-19
7
Showing 9 of 9 rows

Other info

Follow for update