Hilbert: Recursively Building Formal Proofs with Informal Reasoning

About

Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically checked. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2\% on miniF2F, 6.6\% points above the best publicly available method. Hilbert achieves the \textbf{strongest known result} from a publicly available model on PutnamBench. It solves 462/660 problems (70.0\%), outperforming proprietary approaches like SeedProver (50.4\%) and achieving a 422\% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation. Code is available at https://github.com/Rose-STL-Lab/ml-hilbert.

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, Ke Ye• 2025

Related benchmarks

Task	Dataset	Result
Formal Theorem Proving	MiniF2F (test)	Pass@199.2	128
Formal Theorem Proving	PutnamBench	Solved Count462	42
Automated Formal Theorem Proving	Putnam 2025	Average Score1.22e+3	28
Theorem Proving	PutnamBench Lean	Solved Rate462	23
Formal Theorem Proving	Combibench	Solve Rate2.05	15
Formal Theorem Proving	Inequality	567NEQ3.1	13
Formal Theorem Proving	Number Theory	PutnamBench2.51	13
Theorem Proving	PutnamBench (test)	Accuracy72	13
Theorem Proving	567NEQ	Solved Problems51	13
Theorem Proving	ChenNEQ	Solved Problems31	13

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord