Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AppWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic task solvingAppWorld
TGC90
28
Agentic TaskAppWorld C (test)
PR Score48.4
26
Agentic TaskAppWorld N (test)
PR Score57.1
26
Multi-turn tool-useAppWorld
Avg@463.6
25
Agentic Task CompletionAppWorld (test-normal)
Accuracy56.5
22
Agentic Tool-useAppWorld (Challenge)
TGC83.7
20
Agentic Tool-useAppWorld Normal
Task Goal Completion (TGC)89.3
20
Agent TaskAppWorld Normal (test)
TGC76.2
20
Tool-use agentic performanceAppWorld
Avg@464.88
19
Task and Scenario Goal CompletionAppWorld normal (test)
Task Goal Completion91.2
18
Task goal completionAppWorld (test challenge)
Goal Completion Score32
16
Interactive environment task executionAppWorld normal (test)
Avg@8 Success4,554
15
Agentic Task CompletionAppWorld Challenge (test)
Task Goal Completion (TGC)49.88
13
Agent TaskAppWorld Challenge (test)
Task Goal Completion (TGC)66
13
Agentic Task CompletionAppWorld LeaderBoard
Greedy Success Rate48.8
13
Scenario-level policy synthesisAppWorld normal (test)
Task Goal Completion (TGC)98.2
12
Agentic Task CompletionAppWorld normal Hard (test)
Accuracy39.7
11
Agentic Task CompletionAppWorld Easy normal (test)
Accuracy86
11
Interactive coding-centric agent tasksAppWorld
Success Rate22.6
10
App-based Task ExecutionAppWorld-Challenge
Task Goal Completion (TGC)52.8
10
App-based Task ExecutionAppWorld Normal
Task Goal Completion (TGC)71.4
10
Tool ShortlistingAppWorld v1.0 (test)
R-precision (AZ)0.71
9
Task Goal CompletionAppWorld
Average Completion Score @418.46
7
Agent task completionAppWorld
TGC Success Rate (N)83.7
7
Scenario-level policy synthesisAppWorld challenge (test)
TGC98.3
7
Showing 25 of 36 rows