Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AppWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic task solvingAppWorld
TGC90
28
Multi-turn tool-useAppWorld
Avg@463.6
25
Agentic Tool-useAppWorld (Challenge)
TGC83.7
20
Agentic Tool-useAppWorld Normal
Task Goal Completion (TGC)89.3
20
Tool-use agentic performanceAppWorld
Avg@464.88
19
Task and Scenario Goal CompletionAppWorld normal (test)
Task Goal Completion91.2
18
Task goal completionAppWorld (test challenge)
Goal Completion Score32
16
Interactive environment task executionAppWorld normal (test)
Avg@8 Success4,554
15
Agent TaskAppWorld Challenge (test)
Task Goal Completion (TGC)66
13
Agentic Task CompletionAppWorld LeaderBoard
Greedy Success Rate48.8
13
Tool ShortlistingAppWorld v1.0 (test)
R-precision (AZ)0.71
9
Agent TaskAppWorld Average
Average Score59.5
7
Agent TaskAppWorld Normal (test)
TGC76.2
7
Multimodal app-use reasoningAppWorld
Cost0.05
7
Agent-based interactive task executionAppWorld
Accuracy64.9
5
Agentic Task SolvingAppWorld (test-n)
TGC Average81.15
4
Task-goal completionAppWorld Challenge Qwen-2.5-32B (test)
Average Task Completion Score51
4
Task-goal completionAppWorld Normal Qwen-2.5-32B (test)
Average Task Completion Score75
4
Web Task ExecutionAppWorld Normal (test)
Task Goal Success Rate89.5
4
App-based Agentic TaskAppWorld unseen tasks (test)
Pass@166.6
3
Agent Task and Scenario CompletionAppWorld (dev)
Task Goal Completion89.5
2
Agent Task and Scenario CompletionAppWorld (train)
Task Goal Completion91.1
2
Showing 22 of 22 rows