Self-Rewarding Language Models

About

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy93.26	1896
Visual Question Answering	VizWiz	Accuracy56.1	1820
Visual Question Answering	GQA	--	1425
Mathematical Reasoning	GSM8K (test)	Accuracy76.04	954
Multimodal Understanding	MMBench	--	847
Language Understanding	MMLU	Accuracy33	844
Instruction Following	IFEval	--	836
Commonsense Reasoning	PIQA	Accuracy47.41	757
Reasoning	BBH	Accuracy31.2	726
Instruction Following	AlpacaEval 2.0	Win Rate20.44	722

Showing 10 of 68 rows

Other info

Follow for update

@wizwand_team Discord