Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

About

Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)77.71
543
Mathematical ReasoningOlympiad Bench
Accuracy37.95
222
Mathematical ReasoningMinerva
Accuracy (Acc)30.22
146
Mathematical ReasoningMathematical Reasoning Aggregate
Average Score41.72
37
Question AnsweringNQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle
NQ Score49.55
22
Logical reasoningK&K Logic Puzzles
Accuracy (Level 3)96
12
Mathematical ReasoningAIME
Average Score (@16)15.52
11
STEM ReasoningMinerva
Avg@3247.01
8
Showing 8 of 8 rows

Other info

Follow for update