Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Instruction Following Evaluation on ArenaHard v1
Loading...
38
ArenaHardv1 Score
+RL (Skywork-Reward-V2-Llama-3.1-8B)
5.552
13.976
22.4
30.824
Jul 2, 2025
ArenaHardv1 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
ArenaHardv1 Score
+RL (Skywork-Reward-V2-Llama-3.1-8B)
Model=Qwen2.5-7B
2025.07
38
Instruct (official)
Model=Qwen2.5-7B
2025.07
37.9
+RL (Skywork-Reward-V2-Qwen3-4B)
Model=Qwen2.5-7B
2025.07
35
+RL (Skywork-Reward-Gemma-2-27B-v0.2)
Model=Qwen2.5-7B
2025.07
34.5
+RL (Skywork-Reward-Llama-3-8B-v0.2)
Model=Qwen2.5-7B
2025.07
29.8
Instruct (official)
Model=Llama-3.1-8B
2025.07
24.9
+SFT
Model=Qwen2.5-7B
2025.07
22.1
+RL (Skywork-Reward-V2-Llama-3.1-8B)
Model=Llama-3.1-8B
2025.07
20.8
+RL (Skywork-Reward-V2-Qwen3-4B)
Model=Llama-3.1-8B
2025.07
18.8
Base
Model=Qwen2.5-7B
2025.07
16.2
+RL (Skywork-Reward-Gemma-2-27B-v0.2)
Model=Llama-3.1-8B
2025.07
14
+SFT
Model=Llama-3.1-8B
2025.07
12.6
+RL (Skywork-Reward-Llama-3-8B-v0.2)
Model=Llama-3.1-8B
2025.07
9.7
Base
Model=Llama-3.1-8B
2025.07
6.8
Feedback
Search any
task
Search any
task