Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

About

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, inherent to RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while providing more stable and efficient training. Extensive experiments demonstrate that PACS significantly outperforms strong open-source models and RLVR baselines, yielding substantial average gains of $\textbf{+8.26\%}$ (4B) and $\textbf{+9.57\%}$ (8B) over base models offering a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy57.58
251
Mathematical ReasoningAIME 2025
Accuracy46.38
227
Mathematical ReasoningAMC 23
Accuracy90.45
198
Mathematical ReasoningMATH 500
MATH 500 Accuracy95.09
106
Mathematical ReasoningBeyond AIME
Accuracy28.86
32
Showing 5 of 5 rows

Other info

Follow for update