Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

About

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in which clipped ratio-based scores are optimized rather than the sequence likelihoods that govern generation, and score-insensitive credit assignment, in which rollout-level credit does not reflect the current score gaps between positive and negative rollouts. To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives, together with a curriculum-scheduled margin that preserves separation pressure as training progresses. Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks. Code will be released upon paper acceptance.

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy12.3
479
Mathematical ReasoningMATH 500
Top-1 Accuracy82.2
384
Mathematical ReasoningAMC
Accuracy (%)53.8
368
Mathematical ReasoningOlympiadBench
Accuracy20.4
213
Mathematical ReasoningHMMT 2025--
194
Mathematical ReasoningHMMT25
Accuracy (%)8
115
Mathematical ReasoningAMC
Average Pass@3283.8
44
Mathematical ReasoningAIME 26
Accuracy12.8
41
Mathematical ReasoningAIME 2026
Average Success Rate (avg@32)46.8
29
Mathematical ReasoningAIME25
Accuracy15.6
6
Showing 10 of 11 rows

Other info

Follow for update