Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
About
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)77.71 | 543 | |
| Mathematical Reasoning | Olympiad Bench | Accuracy37.95 | 222 | |
| Mathematical Reasoning | Minerva | Accuracy (Acc)30.22 | 146 | |
| Mathematical Reasoning | Mathematical Reasoning Aggregate | Average Score41.72 | 37 | |
| Question Answering | NQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle | NQ Score49.55 | 22 | |
| Logical reasoning | K&K Logic Puzzles | Accuracy (Level 3)96 | 12 | |
| Mathematical Reasoning | AIME | Average Score (@16)15.52 | 11 | |
| STEM Reasoning | Minerva | Avg@3247.01 | 8 |