TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

About

The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.

Yicheng Tao, Hongteng Xu• 2026

Related benchmarks

Task	Dataset	Result
Theorem Proving	DeepMath	FR (Fetch Rate)94	8
Theorem Proving	DeepTheorem	False Rate54	8
Theorem Proving	Small-scale benchmark Overall	VR33	8
Formal Theorem Proving	large-scale benchmark 2,000 problems (test)	FR Rate0.813	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord