Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

About

We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen• 2024

Related benchmarks

Task	Dataset	Result
Reddit Summary Alignment	Reddit Summary normalized rewards (test)	Faithfulness Reward0.48	60
Helpful Assistant Alignment	Helpful Assistant normalized rewards (test)	Helpfulness Reward (r1)45	60
Assistant Response Alignment (Helpfulness and Harmlessness)	HH-RLHF (test)	Helpfulness Win Rate76	31
Helpfulness	Alpaca Eval	Alpaca Eval (%)13.15	22
Alignment	UltraFeedback (test)	Honesty Score36.5	20
Preference Alignment	Psoups (test)	Helpfulness (RM)0.9	13
Molecular Property Optimization	3d complex	Normalized Reward97	10
Molecular Property Optimization	antibacterial like	Normalized Reward90	10
Molecular Property Optimization	drug like	Normalized Reward0.71	10
Molecular Property Optimization	fragment like	Normalized Reward71	10

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord