Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

About

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
836
Math ReasoningGSM8K
Pass@4 Accuracy95.83
54
Math ReasoningMATH 500
Pass@475.65
39
Multi-task KnowledgeMMLU-Pro
MMLU-Pro Score0.5
33
Math ReasoningAIME 24
Pass@1636.67
27
Math ReasoningAMC
Pass@871.08
27
Math ReasoningAMC
Avg@8 Accuracy47.29
27
CodeCRUX
Accuracy @554.37
27
Math ReasoningMATH 500
Success Rate (pass@4)86.6
27
Math ReasoningAIME 2024
Accuracy (avg@16)13.75
27
Showing 10 of 14 rows

Other info

Follow for update