Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning
About
Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Lemma Judging | NaturalProofs (test) | Exact Match Accuracy93.2 | 12 | |
| Lemma Judging | IMO-Lemma (IL) (test) | Exact Match Accuracy93.6 | 12 | |
| Lemma Judging | DeepTheorem sampled perturbation (test) | Exact Match Accuracy90.9 | 12 | |
| Mathematical Problem Solving | Putnam-Axiom (PA) | Exact Match Acc39.5 | 12 | |
| Mathematical Problem Solving | MATH-Perturb MP-hard | Exact Match Accuracy56.3 | 12 | |
| Mathematical Problem Solving | TheoremQA TQ-Math | Exact Match Accuracy57.7 | 12 | |
| Premise Selection | Isabelle (IS) Premise Selection (test) | Exact Match Accuracy86.3 | 12 | |
| Lemma Judging | NLPS (test) | Exact Match Accuracy86.4 | 12 | |
| Mathematical Problem Solving | CounterMATH (CM) | F1 Score61.3 | 12 | |
| Mathematical Problem Solving | MATH-Perturb MP-simple | Exact Match Accuracy72.8 | 12 |