Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

About

From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell• 2025

Related benchmarks

TaskDatasetResultRank
Multilingual Mathematical ReasoningMT Math100
Accuracy26.72
24
Multilingual General KnowledgeGlobal MMLU Lite (subset of 18 languages)
Accuracy13.48
6
Multilingual Mathematical ReasoningMGSM 18 languages
Accuracy33.66
6
Multilingual Reading ComprehensionBelebele 18 languages
Accuracy12.94
6
Multilingual Reasoning and General KnowledgeOverall (18 languages)
Accuracy21.7
6
Showing 5 of 5 rows

Other info

Follow for update