DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
About
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7\% over GRPO and 6\% over DAPO across six benchmark tasks for a 1.5B model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy11.4 | 479 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy83 | 384 | |
| Mathematical Reasoning | AMC | Accuracy (%)52.9 | 368 | |
| Mathematical Reasoning | OlympiadBench | Accuracy18.7 | 213 | |
| Mathematical Reasoning | HMMT 2025 | -- | 194 | |
| Mathematical Reasoning | HMMT25 | Accuracy (%)5.3 | 115 | |
| Mathematical Reasoning | AMC | Average Pass@3282 | 44 | |
| Mathematical Reasoning | AIME 26 | Accuracy9.4 | 41 | |
| Mathematical Reasoning | AIME 2026 | Average Success Rate (avg@32)45.1 | 29 | |
| Mathematical Reasoning | AIME25 | Accuracy12.2 | 6 |