Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Large Language Model Reasoning on CMMLU, GSM8K, and HumanEval (test)
Loading...
40.4
Average Accuracy
Fine-tuned
29.896
32.623
35.35
38.077
May 24, 2025
Average Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average Accuracy
Fine-tuned
Backbone=LLaMa2
2025.05
40.4
NPS
Backbone=LLaMa2
2025.05
35.3
Consensus TIES
Backbone=LLaMa2
2025.05
34.4
Ties-Merging
Backbone=LLaMa2
2025.05
34.2
Consensus TA
Backbone=LLaMa2
2025.05
33.5
Task Arithmetic
Backbone=LLaMa2
2025.05
30.4
Averaging
Backbone=LLaMa2
2025.05
30.3
Feedback
Search any
task
Search any
task