Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
About
The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.
Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Ant v4 | Average Return3.58e+3 | 46 | |
| Continuous Control | MuJoCo Walker2d v4 | -- | 39 | |
| Continuous Control | MuJoCo HalfCheetah v4 | Average Return1.23e+4 | 36 | |
| Continuous Control | Walker2D v5 | Avg Return5.80e+3 | 17 | |
| Continuous Control | Hopper v5 | Average Return3.70e+3 | 15 | |
| Continuous Control | Gym MuJoCo Hopper v4 | Average Return3.53e+3 | 15 | |
| Continuous Control | Gym MuJoCo Suite Aggregate | IQM1.143 | 15 | |
| Continuous Control | Gym MuJoCo Humanoid v4 | Average Return6.03e+3 | 15 | |
| Continuous Control | MuJoCo Humanoid v5 | Maximum Average Return6.33e+3 | 13 | |
| Continuous Control | Humanoid v5 | Average Return5.27e+3 | 13 |
Showing 10 of 22 rows