Proximal Supervised Fine-Tuning

About

Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.

Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	836
Mathematical Reasoning	AMC	Accuracy (%)44.84	368
Mathematical Reasoning	Minerva	Pass@1 Accuracy32.26	289
Mathematical Multimodal Reasoning	MathVerse	Accuracy44.14	259
Mathematical Multimodal Reasoning	MathVista	Accuracy72.2	258
Mathematical Reasoning	MATH 500	--	236
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy43.33	216
Mathematical Reasoning	OlympiadBench	Accuracy36.02	213
Question Answering	TruthfulQA	Accuracy80.19	164
LLM Alignment Evaluation	AlpacaEval 2	LC Win Rate23.29	89

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord