Softmax Deep Double Deterministic Policy Gradients

About

A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.

Ling Pan, Qingpeng Cai, Longbo Huang• 2020

Related benchmarks

Task	Dataset	Result
Continuous Control	Ant v4	Average Return2.96e+3	15
Continuous Control	HalfCheetah v4	Max Average Return7.16e+3	12
Continuous Control	Hopper v4	Maximum Average Return3.52e+3	5
Continuous Control	Walker2d v4	Average Return3.40e+3	5
Robot Control	PandaReach v2	Max Average Return-48.36	5
Continuous Control	Walker2d v4	Number of Interactions (10^4 steps)65.5	5
Continuous Control	Ant v4	Interaction Count ($10^4$ steps)28	5
Robot Control	QuadX-Waypoints v1	Max Average Return383	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord