Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Language Model Capability on MMLU, GSM8K, HumanEval, and BBH Aggregate
Loading...
68.42
Average Score
VAR
11.6048
26.3549
41.105
55.8551
Feb 16, 2025
Average Score
Updated 26d ago
Evaluation Results
Method
Method
Links
Average Score
VAR
Base Model=Qwen2.5-7B
2025.02
68.42
ALoL
Base Model=Qwen2.5-7B
2025.02
62.4
Base
Base Model=Qwen2.5-7B
2025.02
61.8
DPO
Base Model=Qwen2.5-7B
2025.02
57.44
VAR
Base Model=Llama2-7B
2025.02
24.2
ALoL
Base Model=Llama2-7B
2025.02
22.71
DPO
Base Model=Llama2-7B
2025.02
20.86
Base
Base Model=Llama2-7B
2025.02
13.79
Feedback
Search any
task
Search any
task