Share your thoughts, 1 month free Claude Pro on usSee more

Next-state prediction on WebShop

79.05EM Accuracy

Qwen2.5-7B

Updated 4mo ago

Evaluation Results

Method	Links
Qwen2.5-7B 2025.12		79.05
Llama3.1-8B 2025.12		77.24
Gemini-2.5-flash 2025.12		66.09
GPT-5 2025.12		65.9
GPT-4o 2025.12		64.62
GPT-4.1 2025.12		64.23
GPT-4-turbo 2025.12		62.76
GPT-4o-mini 2025.12		61.93
Claude-sonnet-4.5 2025.12		58.8
GPT-4o 2025.12		58.2
GPT-4.1 2025.12		58.07
Gemini-2.5-flash 2025.12		57.64
Claude-sonnet-4.5 2025.12		56.65
GPT-4o-mini 2025.12		56.59
GPT-4-turbo 2025.12		52.45
GPT-5 2025.12		46.12