Share your thoughts, 1 month free Claude Pro on usSee more

Large Language Model Reasoning on CMMLU, GSM8K, and HumanEval (test)

40.4Average Accuracy

Fine-tuned

Updated 3mo ago

Evaluation Results

Method	Links
Fine-tuned 2025.05		40.4
NPS 2025.05		35.3
Consensus TIES 2025.05		34.4
Ties-Merging 2025.05		34.2
Consensus TA 2025.05		33.5
Task Arithmetic 2025.05		30.4
Averaging 2025.05		30.3