Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Multitask Evaluation on Math500, GPQA, HumanEval, MBPP, AE2 LC Aggregate
Loading...
40.7
Average Score
Llama3.2-3B-GRLO+RLVR
20.732
25.916
31.1
36.284
May 14, 2026
Average Score
Updated 16d ago
Evaluation Results
Method
Method
Links
Average Score
Llama3.2-3B-GRLO+RLVR
Backbone=Llama3.2-3B,...
2026.05
40.7
Llama3.2-3B-GRLO
Backbone=Llama3.2-3B,...
2026.05
39.3
Llama3.2-3B-Instruct
Backbone=Llama3.2-3B,...
2026.05
35.6
Llama3.2-3B-RLVR
Backbone=Llama3.2-3B,...
2026.05
30.7
Llama3.2-3B-SFT
Backbone=Llama3.2-3B,...
2026.05
21.5
Feedback
Search any
task
Search any
task