DafnyPro: LLM-Assisted Automated Verification for Dafny Programs

About

We present DafnyPro, an inference-time framework that enhances LLMs for generating verification annotations in Dafny. DafnyPro comprises three key components: a diff-checker that prevents modifications to base program logic, a pruner that removes unnecessary invariants, and a hint-augmentation system that retrieves and applies predefined, problem-independent proof strategies. We evaluate DafnyPro using Claude Sonnet 3.5 and 3.7 on four benchmarks: Clover, MBPP-Dafny, HumanEval-Dafny, and DafnyBench, achieving consistent performance gains in all cases. Notably, on DafnyBench, the most challenging benchmark, Claude Sonnet 3.5 enhanced with DafnyPro achieves 86% correct proofs, a 16 pp improvement over the base model. We also fine-tune two Qwen models on training data derived from verification attempts by larger models enhanced with DafnyPro. Our 7B and 14B models achieve 68% and 70% correct proofs on DafnyBench, respectively, demonstrating that smaller models can maintain high verification accuracy.

Debangshu Banerjee, Olivier Bouissou, Stefan Zetzsche• 2026

Related benchmarks

Task	Dataset	Result	Rank
Dafny Program Verification	HumanEvalDafny (test)	Verification Success Rate (NoDiff)86.9		4
Dafny Program Verification	DafnyBench (test)	Verification Rate (NoDiff)81.6		4

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord