PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment
About
We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| CarRun | Bullet Safety Gym | Normalized Reward93 | 8 | |
| BallCircle | Bullet Safety Gym | Normalized Reward71 | 8 | |
| CarCircle | Bullet Safety Gym | Normalized Reward0.54 | 8 | |
| AntVelocity | Safety Gym | Normalized Reward0.93 | 6 | |
| BallRun | Bullet Gym | Normalized Reward0.41 | 6 | |
| CarGoal | Safety Gym | Normalized Reward0.38 | 6 | |
| DroneRun | Bullet Gym | Normalized Reward0.57 | 6 | |
| HalfCheetahVelocity | Safety Gym | Normalized Reward86 | 6 | |
| HopperVelocity | Safety Gym | Normalized Reward66 | 6 | |
| SwimmerVelocity | Safety Gym | Normalized Reward0.62 | 6 |