Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
About
In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | -- | 850 | |
| Language Understanding | MMLU | Accuracy41 | 756 | |
| Reasoning | BBH | Accuracy30.2 | 507 | |
| Question Answering | ARC Easy | Normalized Acc85.2 | 385 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)84.61 | 358 | |
| Instruction Following | IFEval | -- | 292 | |
| Multitask Language Understanding | MMLU | Accuracy72.33 | 206 | |
| Common Sense Reasoning | HellaSwag | Accuracy76.42 | 164 | |
| Reading Comprehension | RACE | Accuracy78.82 | 151 | |
| Instruction Following | AlpacaEval | Win Rate77.4 | 125 |