What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

About

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.

Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Minerva	Pass@1 Accuracy43.1	289
Mathematical Reasoning	MATH 500	pass@186.2	239
Mathematical Reasoning	AIME 25	Pass@1 Accuracy26.9	190
Mathematical Reasoning	AMC	Pass@1 Accuracy36.1	119
Mathematical Reasoning	MATH 500	Pass@1 Rate39.7	113
Mathematical Reasoning	AMC	Pass@1 Accuracy68.5	84
Mathematical Reasoning	AIME25	Pass@160.2	48
General Reasoning	GPQA	pass@126	38
Mathematical Reasoning	MATH 500	Pass@177.8	12
Scientific Reasoning	GPQA	pass@138.2	12

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord