Robust Asynchronous Planning via Auto-Formalization

About

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang• 2026

Related benchmarks

Task	Dataset	Result
Makespan Accuracy	AsyncPlan-XXL	Accuracy (S5)98	24
Plan Generation	Robo Challenge (Online)	Plan Accuracy85.7	16
Asynchronous planning	AsyncHow	Makespan Accuracy98.44	15
Makespan Accuracy	Robotouille	Makespan Accuracy20	12
Plan Generation	Robo Challenge (Offline)	Plan Accuracy100	12
Asynchronous planning	Robotouille	Makespan Accuracy17.5	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord