Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

About

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
1036
Language UnderstandingMMLU
Accuracy41
825
ReasoningBBH
Accuracy30.2
672
Instruction FollowingIFEval
IFEval Accuracy70.06
625
Multitask Language UnderstandingMMLU
Accuracy72.33
413
Question AnsweringARC Easy
Normalized Acc85.2
389
Mathematical ReasoningGSM8K
Accuracy (GSM8K)84.61
358
Mathematical ReasoningMathQA
Accuracy72.9
305
Instruction FollowingAlpacaEval
Win Rate77.4
227
Common Sense ReasoningHellaSwag
Accuracy76.42
213
Showing 10 of 86 rows
...

Other info

Follow for update