Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
STEM Reasoning on GPQA Diamond (Accuracy avg@4)
Loading...
55
Accuracy (avg@4)
PAPO
26.504
33.902
41.3
48.698
Mar 27, 2026
Accuracy (avg@4)
Updated 20d ago
Evaluation Results
Method
Method
Links
Accuracy (avg@4)
PAPO
Model=Qwen2.5-14B
2026.03
55
ORM(GRPO)
Model=Qwen2.5-14B
2026.03
47
PAPO
Model=Qwen3-4B-Base
2026.03
43
PAPO
Model=Qwen2.5-7B
2026.03
42.4
ORM(GRPO)
Model=Qwen2.5-7B
2026.03
40.7
ORM(DAPO)
Model=Qwen3-4B-Base
2026.03
39
ORM(GRPO)
Model=Qwen2.5-3B
2026.03
38.8
PAPO
Model=Qwen2.5-3B
2026.03
36.3
Base
Model=Qwen3-4B-Base
2026.03
34.9
Base
Model=Qwen2.5-14B
2026.03
34.4
Base
Model=Qwen2.5-7B
2026.03
30.1
Base
Model=Qwen2.5-3B
2026.03
27.6
Feedback
Search any
task
Search any
task