Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Simulated Tasks

Benchmarks

Task NameDataset NameSOTA ResultTrend
High-level planningSimulated Tasks All tasks
Success Rate86.1
4
High-level planningSimulated Tasks >7 actions (Long split)
Success Rate65.18
4
High-level planningSimulated Tasks Medium 3–7 actions
Success Rate97.96
4
High-level planningSimulated Tasks ≤2 actions (Short)
Success Rate97.19
4
Showing 4 of 4 rows