Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Evaluation on Aggregate Benchmarks
Loading...
69.26
Average Score
GOLF
22.5016
34.6408
46.78
58.9192
Mar 4, 2026
Average Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average Score
GOLF
Model Backbone=Qwen-3-8B
2026.03
69.26
Rubric-as-Reward
Model Backbone=Qwen-3-8B
2026.03
67.08
Pairwise-GRPO
Model Backbone=Qwen-3-8B
2026.03
66.97
Critique-GRPO
Model Backbone=Qwen-3-8B
2026.03
66.96
Direct-Likert
Model Backbone=Qwen-3-8B
2026.03
62.99
Qwen-3-8B
Model Backbone=Qwen-3-8B
2026.03
53.95
GOLF
Model Backbone=Llama-3...
2026.03
50.19
Critique-GRPO
Model Backbone=Llama-3...
2026.03
40.92
Rubric-as-Reward
Model Backbone=Llama-3...
2026.03
40.11
Pairwise-GRPO
Model Backbone=Llama-3...
2026.03
39.94
Direct-Likert
Model Backbone=Llama-3...
2026.03
35.79
Llama-3.1-8B-Instruct
Model Backbone=Llama-3...
2026.03
24.3
Feedback
Search any
task
Search any
task