Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Agent Planning and Execution on TaskCraft
Loading...
0.7533
pass@1
Agent KB
0.573068
0.619859
0.66665
0.713441
Feb 8, 2026
pass@1
Updated 4d ago
Evaluation Results
Method
Method
Links
pass@1
Agent KB
Model Family=GPT-4.1,...
2026.02
0.7533
Agent KB
Model Family=GPT-4.1,...
2026.02
0.7267
TodoEvolve + Smolagents
Model Family=GPT-5-Min...
2026.02
0.7267
TodoEvolve + Smolagents
Model Family=DeepSeek...
2026.02
0.7133
Flash-Searcher
Model Family=GPT-5-min...
2026.02
0.6967
Flash-Searcher
Model Family=DeepSeek...
2026.02
0.6933
TodoEvolve + Smolagents
Model Family=Kimi K2,...
2026.02
0.6933
Cognitive Kernel-Pro
Model Family=Claude-3....
2026.02
0.66
Smolagents
Model Family=GPT-5-min...
2026.02
0.64
Agent KB
Model Family=GPT-4.1,...
2026.02
0.6167
OWL Workforce
Model Family=GPT-4o+o3...
2026.02
0.5833
Flash-Searcher
Model Family=Kimi K2,...
2026.02
0.58
Feedback
Search any
task
Search any
task