Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

About

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)81.4	543
Mathematical Reasoning	AIME 2024	Accuracy17.7	479
Mathematical Reasoning	AMC	Accuracy (%)50.5	368
Mathematical Reasoning	AIME 2025	Accuracy17.7	311
Mathematical Reasoning	Minerva	Pass@1 Accuracy32.7	289
Mathematical Reasoning	Minerva Math	Accuracy18.1	233
Mathematical Reasoning	AIME 2024	Accuracy31.8	220
Mathematical Reasoning	AIME 2025	Accuracy26.4	214
General Reasoning	MMLU-Pro	Accuracy52.1	201
Mathematical Reasoning	Minerva	Accuracy (Acc)40.3	146

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord