Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

About

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)81.4
543
Mathematical ReasoningAIME 2024
Accuracy17.7
479
Mathematical ReasoningAMC
Accuracy (%)50.5
368
Mathematical ReasoningAIME 2025
Accuracy17.7
311
Mathematical ReasoningMinerva
Pass@1 Accuracy32.7
289
Mathematical ReasoningMinerva Math
Accuracy18.1
233
Mathematical ReasoningAIME 2024
Accuracy31.8
220
Mathematical ReasoningAIME 2025
Accuracy26.4
214
General ReasoningMMLU-Pro
Accuracy52.1
201
Mathematical ReasoningMinerva
Accuracy (Acc)40.3
146
Showing 10 of 36 rows

Other info

Follow for update