Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Science Reasoning on GPQA Diamond (Avg@4)
Loading...
44.8
Avg@4 Accuracy
GRPO + RePro
18.176
25.088
32
38.912
Dec 1, 2025
Avg@4 Accuracy
Updated 3d ago
Evaluation Results
Method
Method
Links
Avg@4 Accuracy
GRPO + RePro
Backbone=Hunyuan-1.8B-...
2025.12
44.8
GRPO
Backbone=Hunyuan-1.8B-...
2025.12
43.6
RF++ B + RePro
Backbone=Hunyuan-1.8B-...
2025.12
43.1
PPO + RePro
Backbone=Hunyuan-1.8B-...
2025.12
42.7
RF++ B
Backbone=Hunyuan-1.8B-...
2025.12
42.4
PPO
Backbone=Hunyuan-1.8B-...
2025.12
42.1
PPO + RePro
Backbone=Qwen3-1.7B
2025.12
40.3
PPO
Backbone=Qwen3-1.7B
2025.12
40.2
RF++ B + RePro
Backbone=Qwen3-1.7B
2025.12
39.8
Original
Backbone=Qwen3-1.7B
2025.12
39.5
GRPO + RePro
Backbone=Qwen3-1.7B
2025.12
39.1
RF++ B
Backbone=Qwen3-1.7B
2025.12
38.5
GRPO
Backbone=Qwen3-1.7B
2025.12
38.3
Original
Backbone=Hunyuan-1.8B-...
2025.12
38
PPO + RePro
Backbone=MobileLLM-R1-...
2025.12
24.1
PPO
Backbone=MobileLLM-R1-...
2025.12
22.6
Original
Backbone=MobileLLM-R1-...
2025.12
19.2
Feedback
Search any
task
Search any
task