Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Puzzle Reasoning on FrozenLake
Loading...
82
Success Rate
o4-mini
11.28
29.64
48
66.36
May 5, 2026
Success Rate
Updated 28d ago
Evaluation Results
Method
Method
Links
Success Rate
o4-mini
2026.05
82
Claude 4.5 Sonnet
2026.05
80
GLANCE-Full
Backbone=Qwen2.5-VL-3B...
2026.05
78
Gemini 2.5 Pro
2026.05
78
GLANCE-Base
Backbone=Qwen2.5-VL-3B...
2026.05
73
VAGEN-Full
Backbone=Qwen2.5-VL-3B...
2026.05
72
VAGEN-Base
Backbone=Qwen2.5-VL-3B...
2026.05
71
GLANCE w/ Turn-PPO
Backbone=Qwen2.5-VL-3B...
2026.05
70
Claude 3.7 Sonnet
2026.05
69
Turn-PPO w/ Mask
Backbone=Qwen2.5-VL-3B...
2026.05
68
GRPO w/ Mask
Backbone=Qwen2.5-VL-3B...
2026.05
57
GPT-4o
2026.05
54
Qwen2.5-VL-72B
Backbone=Qwen2.5-VL-72B
2026.05
44
Vanilla-PPO
Backbone=Qwen2.5-VL-3B...
2026.05
21
VLM-R1-3B
Backbone=VLM-R1-3B
2026.05
15
Qwen2.5-VL-7B
Backbone=Qwen2.5-VL-7B
2026.05
14
Qwen2.5-VL-3B
Backbone=Qwen2.5-VL-3B
2026.05
14
Feedback
Search any
task
Search any
task