Soft Adaptive Policy Optimization

About

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy76.1	882
Mathematical Reasoning	AIME 2024	Accuracy49	370
General Knowledge	MMLU	MMLU General Knowledge Accuracy30.4	307
Interactive Decision-making	AlfWorld	Overall Success Rate1.93	295
Mathematical Reasoning	MATH 500	--	236
Mathematical Reasoning	Minerva Math	Accuracy33.8	233
Mathematical Reasoning	AIME 2024	Accuracy53.96	220
Mathematical Reasoning	AMC	Accuracy (ACC)64.7	215
Mathematical Reasoning	AIME 2025	Accuracy34.17	214
Mathematical Reasoning	HMMT 2025	--	194

Showing 10 of 57 rows

Other info

Follow for update

@wizwand_team Discord