FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse Autoformalization

About

Autoformalization aims to produce formal statements that compile and faithfully preserve the intended meaning of informal mathematics. Yet standard single-output evaluation protocols collapse a many-to-many problem into a single-output prediction task. For downstream proving, this granularity is too coarse: a formal statement is not merely a faithful translation endpoint, but also a prover-facing interface whose structure can alter proof search under a fixed budget. We therefore recast autoformalization as budgeted test-time search: FormalEvolve maintains a compilation-feasible archive for reuse, while reporting the deduplicated semantically accepted repertoire for evaluation and downstream proving. It expands the archive with LLM-driven mutation, crossover, bounded patch repair, and symbolic Abstract Syntax Tree (AST) rewrites for structural diversity. Under a generator-call budget of T=100 with a fixed LLM semantic judge, FormalEvolve reaches SH@100 of 58.0% on CombiBench and 84.9% on ProofNet, improving over all no-archive controls while reducing the cross-problem concentration of semantic successes. To assess downstream value, we evaluate the resulting repertoires under a fixed B=64 prover budget, where they improve theorem-complete proving over the matched no-archive control; additional stronger-base statement-generation experiments show that archive-search gains hold with stronger seed and repair models. Manual faithfulness audits calibrate these judge-positive outputs.

Haijian Lu, Wei Wang, Jing Liu• 2026

Related benchmarks

Task	Dataset	Result
Statement generation	ProofNet N=186 (test)	CH@10098.4	11
Statement generation	CombiBench (N=100)	CH@1001	11
Autoformalization and Proving	CombiBench (N=100)	Pass@6444	4
Autoformalization and Proving	ProofNet N=186 (test)	Pass@640.6828	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord