Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Physical Reasoning on PHYRE DeepPHY (test)

14.92Att. 1 Score

Qwen-7B^P&I GRPO

-0.56563.45477.47511.4953Mar 1, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
14.9245.56
2026.03
14.6343.17
2026.03
14.1139.49
2026.03
1434.5
2026.03
13.1843.66
2026.03
1340
2026.03
10.733.56
2026.03
10.4229
2026.03
9.8532
2026.03
9.529.03
2026.03
7.9544.85
2026.03
7.542.67
2026.03
3.6736.13
2026.03
3.438.67
2026.03
3.0330.77
2026.03
2.4314.92
2026.03
2.1722.07
2026.03
2.1712.33
2026.03
1.7810.39
2026.03
1.7310.63
2026.03
1.5228.69
2026.03
1.510.8
2026.03
1.3325.58
2026.03
1.117.44
2026.03
0.8213.45
2026.03
0.7411.88
2026.03
0.6710.1
2026.03
0.679.33
2026.03
0.678.33
2026.03
0.6713.04
2026.03
0.6613.48
2026.03
0.339.83
2026.03
0.175.85
2026.03
0.1725.6
2026.03
0.179.67
2026.03
0.038.7