Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-turn agent decision making on ToolSandbox (test)
Loading...
52.2
Success Rate
H-EPM
47.104
48.427
49.75
51.073
Dec 8, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
H-EPM
Backbone=Qwen3-4B-Inst...
2025.12
52.2
H-EPM
Backbone=Qwen3-4B-Inst...
2025.12
51
H-EPM
Backbone=Qwen3-4B-Inst...
2025.12
50.5
GRPO
Backbone=Qwen3-4B-Inst...
2025.12
50.3
AgentEvolver
Backbone=Qwen3-4B-Inst...
2025.12
49
BASE
Backbone=Qwen3-4B-Inst...
2025.12
47.8
SFT
Backbone=Qwen3-4B-Inst...
2025.12
47.3
Feedback
Search any
task
Search any
task