Safe Exploration in Continuous Action Spaces

About

We address the problem of deploying a reinforcement learning (RL) agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated. We show how to exploit the typically smooth dynamics of these systems and enable RL algorithms to never violate constraints during learning. Our technique is to directly add to the policy a safety layer that analytically solves an action correction formulation per each state. The novelty of obtaining an elegant closed-form solution is attained due to a linearized model, learned on past trajectories consisting of arbitrary actions. This is to mimic the real-world circumstances where data logs were generated with a behavior policy that is implausible to describe mathematically; such cases render the known safety-aware off-policy methods inapplicable. We demonstrate the efficacy of our approach on new representative physics-based environments, and prevail where reward shaping fails by maintaining zero constraint violations.

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, Yuval Tassa• 2018

Related benchmarks

Task	Dataset	Result
Safe Move Prediction	Chess (Exploratory Learning Phase)	Blunder Rate27.63	16
Reinforcement Learning	Safe CartPole	Episode Reward53.8	7
Reinforcement Learning	Spring Pendulum	Episode Reward1.1155	7
Reinforcement Learning	OPF with Battery Energy Storage	Episode Reward-24.1147	7
Safe Reinforcement Learning	Safe CartPole	Training Time (s)1.07e+3	7
Safe Reinforcement Learning	Spring Pendulum	Training Time (s)3.86e+3	7
Safe Reinforcement Learning	OPF with Battery Energy Storage	Training Time (s)6.40e+3	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord