Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

About

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user's specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, Qing He• 2025

Related benchmarks

Task	Dataset	Result
Preference Alignment	Anthropic HH-RLHF (test)	LLM-as-a-Judge Helpful Score5.3	12
Response Preference Evaluation	UltraFeedback (test)	Win Rate55.6	9
Response Evaluation	UltraFeedback (test)	Win Rate18.5	6
Instruction Alignment	UltraFeedback	Instruction Following Win (%)22.7	6
LLM Alignment	UltraFeedback pool (test)	Instruction Following Win Rate50.9	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord