Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret

About

Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, Lin Yan• 2025

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU-Pro
pass@1 Accuracy51.56
69
Mathematical ReasoningAMC23 (val)
Accuracy70.78
24
Mathematical ReasoningAIME 2024 (val)
Pass@1 Success Rate25.83
18
Mathematical ReasoningMATH500 (val)
Accuracy87.54
17
General ReasoningARC-C
Pass@174.57
10
Geometry reasoningGeometry3K (test)
Test Score46.52
8
Mathematical ReasoningCountdown-34 (held-out)
Accuracy78.15
8
Mathematical ReasoningCountdown 4
Accuracy54.54
8
Mathematical ReasoningAIME 2025 (val)
Accuracy21.67
7
Mathematical ReasoningOlympiadBench (val)
Accuracy53.5
4
Showing 10 of 11 rows

Other info

Follow for update