Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

About

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang• 2026

Related benchmarks

Task	Dataset	Result
Safety-constrained Reinforcement Learning	Safety-Gym SafetyPointCircle1 (evaluation)	Average Reward10.86	15
Safety-constrained Reinforcement Learning	Safety-Gym SafetyPointGoal1 (evaluation)	Average Reward5.39	11
Glucose Regulation	UVa Padova simulator Adult cohort 14 days (10 virtual patients)	TIR82	7
Glucose Regulation	UVa/Padova simulator Adolescent cohort 14 days (10 virtual patients)	TIR71.6	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord