Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

About

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., +16.7% Pass@1 improvement on AIME 2024.

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy63.2
882
Instruction FollowingIFEval
IFEval Accuracy85.58
625
Instruction FollowingAlpacaEval 2.0
Win Rate68.2
507
Mathematical ReasoningMATH 500
pass@193.45
239
General KnowledgeMMLU
MMLU General Knowledge Accuracy22.8
234
Mathematical ReasoningAMC
Accuracy (ACC)54.2
203
Mathematical ReasoningMinerva Math
Accuracy59.6
186
Mathematical ReasoningAIME 2024
Accuracy28.2
151
Mathematical ReasoningMinerva
Pass@150.29
138
Mathematical ReasoningOlympiad Bench
Pass@1 Accuracy66.8
115
Showing 10 of 40 rows

Other info

Follow for update