Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

About

Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)77.71	600
Mathematical Reasoning	AIME 2024	Accuracy8.9	394
Mathematical Reasoning	AIME 2025	Accuracy8.8	378
Mathematical Reasoning	Olympiad Bench	Accuracy37.95	254
Mathematical Reasoning	Minerva	Accuracy (Acc)30.22	146
Mathematical Problem Solving	MATH500	Accuracy73.75	96
Mathematical Problem Solving	AIME 25	Accuracy3.96	84
Question Answering	NQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle	Average QA Score36.11	55
Math problem solving	OlympiadBench	Accuracy33.6	50
Mathematical Reasoning	Mathematical Reasoning Aggregate	Average Score41.72	46

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord