Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Datasets
Calendar
Loading...
Benchmarks
Task Name
Dataset Name
Task Name
Dataset Name
SOTA Result
Trend
Results
Step-level correctness assessment
Calendar (test)
PR-AUC
0.789
22
Step-level reasoning verification
Calendar
PR-AUC
69.8
19
Planning
Calendar (test)
Accuracy
48
6
Skill evolution
Calendar merge
Best@20
77.7
1
Showing 4 of 4 rows
25 / page
50 / page
100 / page
1
Feedback
Search any
task
Search any
task