Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

About

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of $\textbf{+8.26\%}$ (4B) and $\textbf{+9.57\%}$ (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy57.58	370
Mathematical Reasoning	AIME 2025	Accuracy46.38	227
Mathematical Reasoning	AMC 23	Accuracy90.45	198
Mathematical Reasoning	MATH 500	MATH 500 Accuracy95.09	106
Mathematical Reasoning	Beyond AIME	Accuracy28.86	45

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord