Verifier-Backed Hard Problem Generation for Mathematical Reasoning

About

Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.

Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Problem Solving	AIME 2025	Top-1 Accuracy (%)11.46	46
Mathematical Problem Solving	AMC	Pass@155.31	27
Mathematical Problem Solving	MATH	Pass@178.99	16
Mathematical Problem Solving	GSM8K	Pass@190.61	15
Indefinite integral	Competition Indefinite Integral	Pass@145.4	7
Indefinite integral	Qualifier Indefinite Integral	Pass@169.4	7
Indefinite integral	Integral Stress (test)	Pass@164.7	7
Mathematical Problem Solving	Olympiad	Pass@142.13	7
Mathematical Problem Solving	Minerva	Pass@133.32	7
Mathematical Problem Solving	AIME 2026	Pass@112.92	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord