Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

About

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., +16.7% Pass@1 improvement on AIME 2024.

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@193.45
153
Mathematical ReasoningMinerva
Pass@150.29
138
Mathematical ReasoningOlympiad Bench
Pass@1 Accuracy66.8
115
Mathematical ReasoningAMC
Pass@192.8
112
Mathematical ReasoningAIME 2025
Pass@139.6
96
Mathematical ReasoningAIME 2024
Pass@155.65
86
Mathematical ReasoningMinerva Math
pass@1 Accuracy52.9
82
General ReasoningMMLU-Pro
Accuracy78.49
48
Scientific ReasoningGPQA Diamond--
45
Mathematical ReasoningAMC23
Pass@192.5
43
Showing 10 of 24 rows

Other info

Follow for update