Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stepwise Alignment for Constrained Language Model Policy Optimization

About

Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs). This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy. Building on this key idea, SACPO aligns LLMs step-wise with each metric while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO offers several advantages, including simplicity, stability, computational efficiency, and flexibility of algorithms and datasets. Under mild assumptions, our theoretical analysis provides the upper bounds on optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.

Akifumi Wachi, Thien Q. Tran, Rei Sato, Takumi Tanabe, Youhei Akimoto• 2024

Related benchmarks

TaskDatasetResultRank
Safety EvaluationStrongREJECT--
65
HelpfulnessHHH
Accuracy89.6
20
HarmlessnessStereotype
Refusal Rate96.84
20
HelpfulnessGSM8K
Accuracy86.5
20
SafetyWildChat
Refusal Rate58.45
20
HarmlessnessXSTest
Refusal Rate88.5
20
HelpfulnessAdvGLUE
Accuracy65.6
20
HelpfulnessSimpleQA
Accuracy0.74
20
Persona AlignmentDignity Peer Persona Dimensions
Dimension A Score4.814
18
Sycophancy EvaluationSycophancyEval
Sycophancy Rate97.4
9
Showing 10 of 12 rows

Other info

Follow for update