Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

About

Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.

Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu, Ze Wang, Ximeng Sun, Hao Chen, Zicheng Liu• 2026

Related benchmarks

Task	Dataset	Result
Lemma Judging	NaturalProofs (test)	Exact Match Accuracy93.2	12
Lemma Judging	IMO-Lemma (IL) (test)	Exact Match Accuracy93.6	12
Lemma Judging	DeepTheorem sampled perturbation (test)	Exact Match Accuracy90.9	12
Mathematical Problem Solving	Putnam-Axiom (PA)	Exact Match Acc39.5	12
Mathematical Problem Solving	MATH-Perturb MP-hard	Exact Match Accuracy56.3	12
Mathematical Problem Solving	TheoremQA TQ-Math	Exact Match Accuracy57.7	12
Premise Selection	Isabelle (IS) Premise Selection (test)	Exact Match Accuracy86.3	12
Lemma Judging	NLPS (test)	Exact Match Accuracy86.4	12
Mathematical Problem Solving	CounterMATH (CM)	F1 Score61.3	12
Mathematical Problem Solving	MATH-Perturb MP-simple	Exact Match Accuracy72.8	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord