Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Aggregate Performance Evaluation on MMLU, GSM, HellaSwag, TruthfulQA, ARC-C, CodeX
Loading...
5.32
Improvement
MADS8B
-6.1096
-3.1423
-0.175
2.7923
May 29, 2026
Improvement
Updated 2d ago
Evaluation Results
Method
Method
Links
Improvement
MADS8B
Base Model=Llama-2-7B,...
2026.05
5.32
MADS3B
Base Model=Llama-2-7B,...
2026.05
5.1
SelectIT
Base Model=Llama-2-7B,...
2026.05
3.15
MADS8B
Base Model=Llama-3-8B,...
2026.05
2.34
NUGGETS
Base Model=Llama-2-7B,...
2026.05
1.51
MADS3B
Base Model=Llama-3-8B,...
2026.05
1.45
ClusterClip
Base Model=Llama-2-7B,...
2026.05
1.29
InsTag
Base Model=Llama-3-8B,...
2026.05
0.96
DEITA
Base Model=Llama-2-7B,...
2026.05
0.91
ClusterClip
Base Model=Llama-3-8B,...
2026.05
0.86
InsTag
Base Model=Llama-2-7B,...
2026.05
0.71
MoDS
Base Model=Llama-2-7B,...
2026.05
-0.27
SelectIT
Base Model=Llama-3-8B,...
2026.05
-0.44
NUGGETS
Base Model=Llama-3-8B,...
2026.05
-1.51
MoDS
Base Model=Llama-3-8B,...
2026.05
-1.57
DEITA
Base Model=Llama-3-8B,...
2026.05
-2.51
IFD
Base Model=Llama-3-8B,...
2026.05
-5.36
IFD
Base Model=Llama-2-7B,...
2026.05
-5.67
Feedback
Search any
task
Search any
task