Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

About

Among approaches for provably safe reinforcement learning, Model Predictive Shielding (MPS) has proven effective at complex tasks in continuous, high-dimensional state spaces, by leveraging a backup policy to ensure safety when the learned policy attempts to take risky actions. However, while MPS can ensure safety both during and after training, it often hinders task progress due to the conservative and task-oblivious nature of backup policies. This paper introduces Dynamic Model Predictive Shielding (DMPS), which optimizes reinforcement learning objectives while maintaining provable safety. DMPS employs a local planner to dynamically select safe recovery actions that maximize both short-term progress as well as long-term rewards. Crucially, the planner and the neural policy play a synergistic role in DMPS. When planning recovery actions for ensuring safety, the planner utilizes the neural policy to estimate long-term rewards, allowing it to observe beyond its short-term planning horizon. Conversely, the neural policy under training learns from the recovery plans proposed by the planner, converging to policies that are both high-performing and safe in practice. This approach guarantees safety during and after training, with bounded recovery regret that decreases exponentially with planning horizon depth. Experimental results demonstrate that DMPS converges to policies that rarely require shield interventions after training and achieve higher rewards compared to several state-of-the-art baselines.

Arko Banerjee, Kia Rahmani, Joydeep Biswas, Isil Dillig• 2024

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningDI single-gate
Mean Return11.6
10
Reinforcement LearningST-road
Mean Performance22.7
6
Reinforcement LearningST-road2d
Mean Score24
6
Reinforcement LearningST-mount-car
Mean Performance81.2
6
Reinforcement LearningST-obstacle2
Mean Score20.2
6
Reinforcement LearningST-obstacle
Mean Performance Score32.7
6
Reinforcement LearningDI dynamic-obs
Mean Score13.2
5
Reinforcement LearningDI-double-gates
Mean Score12.7
5
Reinforcement LearningDI-double-gates+
Mean Reward13
5
Reinforcement LearningDD dynamic-obs
Mean Score7.4
5
Showing 10 of 35 rows

Other info

Follow for update