Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

About

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1043
Language Understanding	MMLU	Accuracy41	844
Instruction Following	IFEval	IFEval Accuracy89.8	836
Reasoning	BBH	Accuracy30.2	726
Multitask Language Understanding	MMLU	Accuracy72.33	520
Instruction Following	AlpacaEval	Win Rate77.4	420
Question Answering	ARC Easy	Normalized Acc85.2	391
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)84.61	358
Mathematical Reasoning	MathQA	Accuracy72.9	354
Mathematical Reasoning	CollegeMATH	Accuracy80.4	327

Showing 10 of 97 rows

...

Other info

Follow for update

@wizwand_team Discord