Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

About

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This paper presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update stepsize of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Bo Cheng• 2020

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningHalfCheetah v3
Mean Reward1.70e+4
15
Reinforcement LearningSwimmer v3
Mean Reward138
15
Reinforcement LearningHumanoid v3
Avg Final Return1.08e+4
7
Reinforcement LearningAnt v3
Average Final Return7.09e+3
7
Reinforcement LearningWalker2d v3
Average Final Return6.42e+3
7
Reinforcement LearningInvertedDoublePendulum v3
Average Final Return9.36e+3
7
Reinforcement LearningHopper v3
Average Final Return3.66e+3
7
Reinforcement LearningPusher v2
Average Final Return-19
7
Continuous ControlHalfcheetah v5
Average Return1.30e+4
7
Continuous ControlHopper v5
Average Return3.52e+3
7
Showing 10 of 13 rows

Other info

Follow for update