Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

About

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Pass@186
112
Mathematical ReasoningMinerva
Pass@148.16
55
Mathematical ReasoningOlympiad
Pass@153.33
50
Mathematical ReasoningAMC23
Avg@1656.41
29
Mathematical ReasoningGAO
Pass@175.32
18
Mathematical ReasoningAIME 25
Average@1620.21
18
Mathematical ReasoningAggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25
Average Score53.36
18
Math ReasoningAIME25
Average@161.06e+3
12
Math ReasoningAIME 24
Average@1614.58
12
Math ReasoningMATH
Pass@177.4
12
Showing 10 of 12 rows

Other info

Follow for update