Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Commonsense Reasoning on HellaSwag, Winogrande, and BoolQ (test)
Loading...
70.6
Accuracy
MSFT
16.312
30.406
44.5
58.594
Mar 23, 2026
Accuracy
Epoch
Updated 2mo ago
Evaluation Results
Method
Method
Links
Accuracy
Epoch
MSFT
Size=Average, Evaluati...
2026.03
70.6
4.12
IES
Size=Average, Evaluati...
2026.03
69.4
3.88
DynamixSFT
Size=Average, Evaluati...
2026.03
69
4.04
SFT
Size=Average, Evaluati...
2026.03
68.2
3.88
Continual SFT
Size=Average, Evaluati...
2026.03
65.2
1.71
Base
Size=Average, Evaluati...
2026.03
18.4
-
Feedback
Search any
task
Search any
task