Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

About

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Pass@186	112
Mathematical Reasoning	Minerva	Pass@148.16	80
Mathematical Reasoning	Olympiad	Pass@153.33	50
Mathematical Reasoning	AMC23	Avg@1656.41	29
Math Reasoning	AIME25	Average@161.06e+3	26
Math Reasoning	AIME 24	Average@1614.58	26
Mathematical Reasoning	GAO	Pass@175.32	18
Mathematical Reasoning	AIME 25	Average@1620.21	18
Mathematical Reasoning	Aggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25	Average Score53.36	18
Math Reasoning	MATH	Pass@177.4	12

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord