Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Task on AppWorld Challenge (test)
Loading...
66
Task Goal Completion (TGC)
ReAct + ACE
29.392
38.896
48.4
57.904
Oct 6, 2025
Task Goal Completion (TGC)
Scenario Goal Completion (SGC)
Updated 19d ago
Evaluation Results
Method
Method
Links
Task Goal Completion (TGC)
Scenario Goal Completion (SGC)
ReAct + ACE
Base LLM=DeepSeek-V3.1...
2025.10
66
48.9
ReAct + ACE
Base LLM=DeepSeek-V3.1...
2025.10
57.3
39.6
ReAct + ACE
Base LLM=DeepSeek-V3.1...
2025.10
54.4
35.2
ReAct + DC (CU)
Base LLM=DeepSeek-V3.1...
2025.10
52.3
30.8
ReAct + ICL
Base LLM=DeepSeek-V3.1...
2025.10
46
27.3
ReAct + GEPA
Base LLM=DeepSeek-V3.1...
2025.10
46
30.2
ReAct + ACE
Base LLM=GPT-OSS-120B,...
2025.10
43.2
20.1
ReAct
Base LLM=DeepSeek-V3.1...
2025.10
41.5
21.6
ReAct + ACE
Base LLM=GPT-OSS-120B,...
2025.10
40.3
20.9
ReAct + GEPA
Base LLM=GPT-OSS-120B,...
2025.10
40.1
20.9
ReAct + ACE
Base LLM=GPT-OSS-120B,...
2025.10
39.6
18.7
ReAct
Base LLM=GPT-OSS-120B,...
2025.10
34.5
15.1
ReAct + DC (CU)
Base LLM=GPT-OSS-120B,...
2025.10
30.8
18.2
Feedback
Search any
task
Search any
task