Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToolSandbox

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Task CompletionToolSandbox (test)
Avg Task Reward0.704
27
Tool Use EvaluationToolSandbox
Similarity0.923
19
Agentic Tool-UseToolSandbox Multi-Tool
TS-M Score53.7
16
Multi-turn agent decision makingToolSandbox (test)
Success Rate52.2
7
Agent Task CompletionToolSandbox
Average Task Reward0.67
2
Showing 5 of 5 rows