$\lambda$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

About

Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.

Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu• 2025

Related benchmarks

Task	Dataset	Result
Long-horizon Mathematical Reasoning	MATH	Result Accuracy75.38	23
Mathematical Reasoning	Math MATH500, AIME24, Minerva-Math, AMC23	MATH500 Score85	18
Scientific Reasoning	Science Domain In-Domain: SampleQA, GPQA(ALL), HLE	SampleQA Score2.77	18
Mathematical Reasoning	GSM8K	Accuracy90.9	14
Knowledge-intensive reasoning	GPQA	Result Score37.88	14
Mathematical Reasoning	AIME	Result Accuracy16.67	14
Mathematical Problem Solving	Math Domain (Out-of-Domain: MATH500, AIME24, Minerva-Math, AMC23)	MATH500 Score89.6	11
Mathematical Reasoning	Math Domain In-Domain	MATH50090	11
Science and Question Answering	Science & QA SampleQA, GPQA, HLE	SampleQA Score1.62	11
Scientific Question Answering	Science & QA Domain Out-of-Domain	SampleQA Score2.91	11

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord