Group Robust Preference Optimization in Reward-free RLHF

About

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathQA	Accuracy44.1	354
Math Reasoning	GSM8K	Accuracy72.7	254
Bias Evaluation	BBQ	Accuracy97.8	171
Mathematical Reasoning	GSM-PLUS	Accuracy52.6	90
General Utility Evaluation	MT_Bench	Agreement Rate67	33
Stereotypical Bias Mitigation	UNQOVER	Accuracy99.9	14
Structural Bias Evaluation	MNLI	Accuracy92.5	14
Structural Bias Evaluation	HANS	Accuracy99	14
Out-of-Domain (OOD) Bias Evaluation	StereoSet	Accuracy60.4	14
General Utility Evaluation	Chatbot	Agree Score52.7	14

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord