Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Puzzle Reasoning on Reasoning Gym (test)
Loading...
17.8
Avg@4
AIPO
3.552
7.251
10.95
14.649
May 8, 2026
Avg@4
Updated 22d ago
Evaluation Results
Method
Method
Links
Avg@4
AIPO
Policy Model=Qwen2.5-7...
2026.05
17.8
AIPO
Policy Model=Qwen2.5-7...
2026.05
16
LUFFY
Policy Model=Qwen2.5-7...
2026.05
15.7
Dr.GRPO
Policy Model=Qwen2.5-7...
2026.05
15
OPSD
Policy Model=Qwen2.5-7...
2026.05
14.7
GRPO
Policy Model=Qwen2.5-7...
2026.05
14.5
LUFFY
Policy Model=Qwen2.5-7...
2026.05
13.9
AIPO
Policy Model=Llama3.2-...
2026.05
13.3
OPSD
Policy Model=Qwen2.5-7...
2026.05
12.4
PRIME
Policy Model=Qwen2.5-7...
2026.05
12.2
LUFFY
Policy Model=Llama3.2-...
2026.05
11.9
AIPO
Policy Model=Llama3.2-...
2026.05
11
SFT
Policy Model=Qwen2.5-7...
2026.05
10.5
OPSD
Policy Model=Llama3.2-...
2026.05
10.4
SFT
Policy Model=Qwen2.5-7...
2026.05
9.8
Original
Policy Model=Qwen2.5-7...
2026.05
9.6
SFT
Policy Model=Llama3.2-...
2026.05
6
LUFFY
Policy Model=Llama3.2-...
2026.05
4.9
OPSD
Policy Model=Llama3.2-...
2026.05
4.6
SFT
Policy Model=Llama3.2-...
2026.05
4.1
Feedback
Search any
task
Search any
task