Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification
About
In the field of reinforcement learning, because of the high cost and risk of policy training in the real world, policies are trained in a simulation environment and transferred to the corresponding real-world environment. However, the simulation environment does not perfectly mimic the real-world environment, lead to model misspecification. Multiple studies report significant deterioration of policy performance in a real-world environment. In this study, we focus on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values, called the uncertainty parameter set. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment. To obtain a policy for the optimization, we propose an off-policy actor-critic approach called the Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach. Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | MuJoCo HumanoidStandup | Average Performance1.20e+5 | 24 | |
| Reinforcement Learning | MuJoCo Half-Cheetah | Average Return4.93e+3 | 18 | |
| Reinforcement Learning | MuJoCo Walker | Average Return4.62e+3 | 14 | |
| Reinforcement Learning | MuJoCo Ant | Average Return5.96e+3 | 14 | |
| Reinforcement Learning | MuJoCo Hopper | Average Return1.25e+3 | 14 | |
| Robot Locomotion | Ant v1 (test) | Performance Score2.37e+3 | 12 | |
| Robot Locomotion | Humanoid v1 (test) | Total Score9.31e+4 | 12 | |
| Continuous Control | HumanoidStandup MuJoCo (test) | Worst Case Performance1.16e+5 | 12 | |
| Continuous Control | MuJoCo HumanoidStandup logarithmic adversary v1 | Average Performance1.19e+5 | 12 | |
| Continuous Control | MuJoCo HumanoidStandup fixed random adversary L=0.1 | Average Performance1.19e+5 | 12 |