Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

About

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., +16.7% Pass@1 improvement on AIME 2024.

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy63.2
882
Instruction FollowingIFEval
IFEval Accuracy85.58
836
Instruction FollowingAlpacaEval 2.0
Win Rate68.2
722
General KnowledgeMMLU
MMLU General Knowledge Accuracy22.8
307
Mathematical ReasoningMATH 500
pass@193.45
239
Mathematical ReasoningMinerva Math
Accuracy59.6
233
Mathematical ReasoningAIME 2024
Accuracy28.2
220
Mathematical ReasoningAMC
Accuracy (ACC)54.2
215
Mathematical ReasoningAIME 2025
Accuracy13.2
214
General ReasoningMMLU-Pro
Accuracy78.49
201
Showing 10 of 52 rows

Other info

Follow for update