Constrained Decision Transformer for Offline Safe Reinforcement Learning
About
Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints. The code is available at https://github.com/liuzuxin/OSRL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Auto-bidding | AuctionNet | Score437.7 | 90 | |
| Auto-bidding | AuctionNet-Sparse | Score44.56 | 45 | |
| Safe Reinforcement Learning | Bullet Safety Gym | Normalized Reward0.61 | 10 | |
| Safe Reinforcement Learning | MetaDrive | Normalized Reward0.4 | 10 | |
| DroneRun | Bullet-Safety-Gym OSRL | Reward0.84 | 9 | |
| CarRun | Bullet-Safety-Gym OSRL | Reward0.96 | 9 | |
| CarCircle | Bullet-Safety-Gym OSRL | Reward0.71 | 9 | |
| BallCircle | Bullet-Safety-Gym OSRL | Reward0.73 | 9 | |
| BallRun | Bullet-Safety-Gym OSRL | Reward0.35 | 9 | |
| Constrained Bidding | AuctionNet | Value357.4 | 9 |