Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToolSandbox

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Task CompletionToolSandbox (test)
Avg Task Reward0.704
27
Tool Use EvaluationToolSandbox
Similarity0.923
19
Multi-turn agent decision makingToolSandbox (test)
Success Rate52.2
7
Agent Task CompletionToolSandbox
Average Task Reward0.67
2
Showing 4 of 4 rows