Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Implicit Safety Alignment from Crowd Preferences

About

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

Qian Lin, Daniel S. Brown• 2026

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningReach online downstream setting
Normalized Reward0.91
6
Reinforcement LearningAnt-vel online downstream setting
Normalized Reward0.94
6
Reinforcement LearningHalfCheetah-vel online downstream setting
Normalized Reward0.96
6
Safe Reinforcement LearningHalfCheetah vel (offline)
Normalized Reward0.96
6
Reinforcement LearningSwimmer-vel online downstream setting
Normalized Reward1
6
Safe Reinforcement LearningReach (offline)
Normalized Reward0.98
6
Safe Reinforcement LearningAnt-vel (offline)
Normalized Reward0.9
6
Safe Reinforcement LearningSwimmer-vel (offline)
Normalized Reward0.99
6
Reinforcement LearningRun online downstream setting
Normalized Reward100
6
Reinforcement LearningCircle online downstream setting
Normalized Reward0.97
6
Showing 10 of 13 rows

Other info

Follow for update