Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

About

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum $k$ up to 256), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@1 but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMinerva
Pass@144.93
138
Mathematical ReasoningAIME 2024 (test)
Accuracy61.6
103
Mathematical ReasoningAIME 24
Pass@132.1
59
Mathematical ReasoningAMC 23
Pass@170.1
46
Mathematical ReasoningMath Benchmarks Aggregate
Pass@148.58
44
Mathematical ReasoningMATH500 1.0 (test)
Accuracy95.7
34
Mathematical ReasoningMATH 500
Pass@181.48
25
Mathematical ReasoningAIME 25
Pass@114.31
24
Mathematical ReasoningAIME24
Pass@132.8
18
Mathematical ReasoningAMC23
Pass@172.7
18
Showing 10 of 16 rows

Other info

Follow for update