Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Chat Performance on Arena-Hard-Auto
Loading...
92.8
Score
GR3
65.552
72.626
79.7
86.774
Mar 11, 2026
Score
Token Count
Updated 1mo ago
Evaluation Results
Method
Method
Links
Score
Token Count
GR3
Model Backbone=Qwen3–8B
2026.03
92.8
1,178
GRPO
Model Backbone=Qwen3–8B
2026.03
90.6
2,343
GR3
Model Backbone=Qwen3–4B
2026.03
85.9
1,377
GRPO
Model Backbone=Qwen3–4B
2026.03
85.8
2,374
Initial
Model Backbone=Qwen3–8B
2026.03
77.2
1,171
Initial
Model Backbone=Qwen3–4B
2026.03
66.6
1,139
Feedback
Search any
task
Search any
task