Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

About

Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy33.3
251
Mathematical ReasoningCollegeMATH
Accuracy46.3
161
Mathematical ReasoningAMC
Accuracy67.5
151
Mathematical ReasoningOlympiad
Accuracy39.7
50
Mathematical ReasoningMATH500
Accuracy76.8
45
Mathematical ReasoningMinerva
Accuracy (@avg1)40.1
33
GUI GroundingScreenSpot-Pro Scientific
Text Accuracy56.94
5
GUI GroundingScreenSpot-Pro Creative
Text Accuracy37.37
5
GUI GroundingScreenSpot-Pro CAD
Text Accuracy32.99
5
GUI GroundingScreenSpot-Pro (dev)
Text Accuracy29.22
5
Showing 10 of 15 rows

Other info

Follow for update