Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

About

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.

Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran• 2026

Related benchmarks

TaskDatasetResultRank
CarRunBullet Safety Gym
Normalized Reward93
8
BallCircleBullet Safety Gym
Normalized Reward71
8
CarCircleBullet Safety Gym
Normalized Reward0.54
8
AntVelocitySafety Gym
Normalized Reward0.93
6
BallRunBullet Gym
Normalized Reward0.41
6
CarGoalSafety Gym
Normalized Reward0.38
6
DroneRunBullet Gym
Normalized Reward0.57
6
HalfCheetahVelocitySafety Gym
Normalized Reward86
6
HopperVelocitySafety Gym
Normalized Reward66
6
SwimmerVelocitySafety Gym
Normalized Reward0.62
6
Showing 10 of 12 rows

Other info

Follow for update