Intent-aligned Formal Specification Synthesis via Traceable Refinement
About
Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval 2021 (test) | Accuracy76.1 | 21 | |
| Math Reasoning | AIME 2024 | Pass@163.3 | 18 | |
| Code Generation | MBPP 2021 (test) | -- | 18 | |
| Specification Synthesis | VERINA | Pass@186.6 | 10 | |
| Mathematical Reasoning | GSM8k 2021 (test) | Accuracy80.7 | 5 | |
| Code Generation | VERINA CodeGen (test) | Pass@135.4 | 3 | |
| Formal Specification Synthesis | VERINA SpecGen (test) | Pass@130.3 | 3 | |
| Math Reasoning | AIME 2025 (test) | Pass@1 Rate50 | 3 |