Share your thoughts, 1 month free Claude Pro on usSee more

Agent Behavior Adaptation on WebShop (WS) (test)

36.7Loop Ratio

Qwen3-4B-Instruct

Updated 1mo ago

Evaluation Results

Method	Links
Qwen3-4B-Instruct 2026.02		36.7
Phi-4-reasoning 2026.02		35.7
Mistral-7B-Instruct 2026.02		17.5
gpt-oss-120b 2026.02		9.2
Qwen3-30B-A3B-Instruct 2026.02		5.7
Qwen3-4B-Thinking 2026.02		4.5
Ministral-3-14B-Instruct 2026.02		4
Phi-4 2026.02		2.6
Llama-3.1-8B-Instruct 2026.02		2.1
Llama-3.3-70B-Instruct 2026.02		1.6
Glm-4-9B-Chat 2026.02		1.5
Qwen3-30B-A3B-Thinking 2026.02		0.7
Gemini 2.5 Flash 2026.02		0.2
Glm-4-32B-0414 2026.02		0.1
DeepSeek-R1 2026.02		0.1
DeepSeek-V3.2 2026.02		0
Gemini 2.5 Pro 2026.02		0