Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

About

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024. The code and models are released at: https://github.com/zhangxy-2019/critique-GRPO

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy63.2	882
Instruction Following	IFEval	IFEval Accuracy85.58	854
Instruction Following	AlpacaEval 2.0	Win Rate68.2	752
Mathematical Reasoning	AIME 2024	Accuracy28.2	394
Mathematical Reasoning	AIME 2025	Accuracy13.2	378
General Knowledge	MMLU	MMLU General Knowledge Accuracy22.8	373
Mathematical Reasoning	Minerva Math	Accuracy59.6	251
Mathematical Reasoning	MATH 500	pass@193.45	239
Mathematical Reasoning	AMC	Accuracy (ACC)54.2	224
General Reasoning	MMLU-Pro	Accuracy78.49	213

Showing 10 of 68 rows

Other info

Follow for update

@wizwand_team Discord