Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Language Modeling and Reasoning on ARC, BBH, HellaSwag, TruthfulQA, LAMBADA, WinoGrande, GSM8K, MT-Bench
Loading...
54.61
ARC (Accuracy)
BitDelta
52.478
53.0315
53.585
54.1385
Feb 15, 2024
ARC (Accuracy)
BBH (Accuracy)
HellaSwag (Accuracy)
TruthfulQA (Accuracy)
LAMBADA (Accuracy)
WinoGrande (Accuracy)
GSM8K (Accuracy)
Average Score
MT-Bench Score
Updated 4d ago
Evaluation Results
Method
Method
Links
ARC (Accuracy)
BBH (Accuracy)
HellaSwag (Accuracy)
TruthfulQA (Accuracy)
LAMBADA (Accuracy)
WinoGrande (Accuracy)
GSM8K (Accuracy)
Average Score
MT-Bench Score
BitDelta
Application=Parameter-...
2024.02
54.61
34.28
79.1
46.6
70.58
69.3
15.16
52.8
4.87
Llama 2-7B UltraChat
Fine-tuning=r = 16 LoR...
2024.02
54.52
34.14
78.99
46.84
70.83
69.53
14.71
52.79
4.93
Llama 2-7B
Model Type=Base Model
2024.02
52.56
33.76
78.96
38.96
68.39
68.98
13.57
50.74
-
Feedback
Search any
task
Search any
task