TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
About
The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Theorem Proving | DeepMath | FR (Fetch Rate)94 | 8 | |
| Theorem Proving | DeepTheorem | False Rate54 | 8 | |
| Theorem Proving | Small-scale benchmark Overall | VR33 | 8 | |
| Formal Theorem Proving | large-scale benchmark 2,000 problems (test) | FR Rate0.813 | 2 |