Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Agent Behavior Adaptation on WebShop (WS) (test)
Loading...
36.7
Loop Ratio
Qwen3-4B-Instruct
-1.468
8.441
18.35
28.259
Feb 2, 2026
Loop Ratio
Updated 4d ago
Evaluation Results
Method
Method
Links
Loop Ratio
Qwen3-4B-Instruct
Model Type=Non-thinkin...
2026.02
36.7
Phi-4-reasoning
Model Type=Thinking Model
2026.02
35.7
Mistral-7B-Instruct
Model Type=Non-thinkin...
2026.02
17.5
gpt-oss-120b
Model Type=Thinking Model
2026.02
9.2
Qwen3-30B-A3B-Instruct
Model Type=Non-thinkin...
2026.02
5.7
Qwen3-4B-Thinking
Model Type=Thinking Model
2026.02
4.5
Ministral-3-14B-Instruct
Model Type=Non-thinkin...
2026.02
4
Phi-4
Model Type=Non-thinkin...
2026.02
2.6
Llama-3.1-8B-Instruct
Model Type=Non-thinkin...
2026.02
2.1
Llama-3.3-70B-Instruct
Model Type=Non-thinkin...
2026.02
1.6
Glm-4-9B-Chat
Model Type=Non-thinkin...
2026.02
1.5
Qwen3-30B-A3B-Thinking
Model Type=Thinking Model
2026.02
0.7
Gemini 2.5 Flash
Model Type=Non-thinkin...
2026.02
0.2
Glm-4-32B-0414
Model Type=Non-thinkin...
2026.02
0.1
DeepSeek-R1
Model Type=Thinking Model
2026.02
0.1
DeepSeek-V3.2
Model Type=Non-thinkin...
2026.02
0
Gemini 2.5 Pro
Model Type=Thinking Model
2026.02
0
Feedback
Search any
task
Search any
task