Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

About

The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.

Yicheng Tao, Hongteng Xu• 2026

Related benchmarks

TaskDatasetResultRank
Theorem ProvingDeepMath
FR (Fetch Rate)94
8
Theorem ProvingDeepTheorem
False Rate54
8
Theorem ProvingSmall-scale benchmark Overall
VR33
8
Formal Theorem Provinglarge-scale benchmark 2,000 problems (test)
FR Rate0.813
2
Showing 4 of 4 rows

Other info

Follow for update