Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Commonsense Reasoning on HellaSwag (Accuracy, AVG, Delta)
Loading...
63.1
Accuracy
Phi-4 14B (w/o LoopUS)
45.524
50.087
54.65
59.213
May 10, 2026
Accuracy
Average Score (AVG)
Delta Change
Updated 21d ago
Evaluation Results
Method
Method
Links
Accuracy
Average Score (AVG)
Delta Change
Phi-4 14B (w/o LoopUS)
Model=Phi-4 14B, Setti...
2026.05
63.1
67
-
Phi-4 14B (w/ LoopUS)
Model=Phi-4 14B, Setti...
2026.05
60.58
68.6
1.7
Qwen 8B (w/o LoopUS)
Model=Qwen 8B, Setting...
2026.05
57.2
63.2
-
Qwen 8B (w/ LoopUS)
Model=Qwen 8B, Setting...
2026.05
56
65.4
2.2
Qwen 4B (w/o LoopUS)
Model=Qwen 4B, Setting...
2026.05
52.1
60.3
-
Qwen 4B (w/ LoopUS)
Model=Qwen 4B, Setting...
2026.05
51.4
62.1
1.8
Qwen 1.7B (w/ LoopUS)
Model=Qwen 1.7B, Setti...
2026.05
46.3
55.3
1.6
Qwen 1.7B (w/o LoopUS)
Model=Qwen 1.7B, Setti...
2026.05
46.2
53.7
-
Feedback
Search any
task
Search any
task