Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Calendar

Benchmarks

Task NameDataset NameSOTA ResultTrend
Step-level correctness assessmentCalendar (test)
PR-AUC0.789
22
Step-level reasoning verificationCalendar
PR-AUC69.8
19
PlanningCalendar (test)
Accuracy48
6
Skill evolutionCalendar merge
Best@2077.7
1
Showing 4 of 4 rows