Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

About

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy73.2	836
Instruction Following	AlpacaEval 2.0	--	722
Instruction Following	MT-Bench	MT-Bench Score6.95	287
Instruction Following	Arena Hard	Win Rate48.03	263
Knowledge	MMLU	Accuracy74.79	161
Commonsense Reasoning	HellaSwag	HellaSwag Score80.22	62
Commonsense Reasoning	ARC	Accuracy91.07	61
Commonsense Reasoning	TruthfulQA	Accuracy71.24	28
Commonsense Reasoning	WinoGrande	Winogrande Score73.48	22
Mathematical Reasoning	Minerva Math	Last Score46.32	17

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord