Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-Task Reasoning on GPQA Diamond
Loading...
55
Pass@1
Standard RLVR
15.064
25.432
35.8
46.168
Jun 10, 2025
Jul 26, 2025
Sep 11, 2025
Oct 28, 2025
Dec 13, 2025
Jan 29, 2026
Mar 17, 2026
Pass@1
Updated 2mo ago
Evaluation Results
Method
Method
Links
Pass@1
Standard RLVR
Backbone=Qwen3-14B-Base
2026.02
55
Composition-RL
Backbone=Qwen3-30B-A3B...
2026.02
54.6
Composition-RL
Backbone=Qwen3-14B-Base
2026.02
54.2
Standard RLVR
Backbone=Qwen3-30B-A3B...
2026.02
50.7
Composition-RL
Backbone=Qwen3-8B-Base
2026.02
48.9
Composition-RL
Backbone=Qwen3-4B-Base...
2026.02
48.5
Composition-RL
Backbone=Qwen3-4B-Base...
2026.02
48.3
Standard RLVR
Backbone=Qwen3-8B-Base
2026.02
48.2
Composition-RL
Backbone=Qwen3-4B-Base
2026.02
46.3
RULEREASONER-8B + DADS
Backbone=Qwen3-8B, Alg...
2025.06
44.9
Standard RLVR
Backbone=Qwen3-4B-Base
2026.02
43.7
DCRL
Backbone=Qwen3-8B-Base
2026.03
37.9
DCRL
Backbone=Qwen3-4B-Base
2026.03
34.3
DCRL
Backbone=Llama3.2-3B-I...
2026.03
24.7
Qwen3-8B-Base
Backbone=Qwen3-8B
2025.06
16.6
Feedback
Search any
task
Search any
task