Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Chat on WildBench
Loading...
68.16
LLM Judge Score
GOLF
-11.3064
9.3243
29.955
50.5857
Mar 4, 2026
LLM Judge Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
LLM Judge Score
GOLF
Model Backbone=Qwen-3-8B
2026.03
68.16
Pairwise-GRPO
Model Backbone=Qwen-3-8B
2026.03
67.77
Rubric-as-Reward
Model Backbone=Qwen-3-8B
2026.03
67.09
Critique-GRPO
Model Backbone=Qwen-3-8B
2026.03
64.84
Direct-Likert
Model Backbone=Qwen-3-8B
2026.03
58.01
Qwen-3-8B
Model Backbone=Qwen-3-8B
2026.03
48.05
GOLF
Model Backbone=Llama-3...
2026.03
34.42
Rubric-as-Reward
Model Backbone=Llama-3...
2026.03
26.51
Pairwise-GRPO
Model Backbone=Llama-3...
2026.03
25.54
Critique-GRPO
Model Backbone=Llama-3...
2026.03
25.09
Direct-Likert
Model Backbone=Llama-3...
2026.03
13.48
Llama-3.1-8B-Instruct
Model Backbone=Llama-3...
2026.03
-8.25
Feedback
Search any
task
Search any
task