Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Large Language Model Reasoning on CMMLU, GSM8K, and HumanEval (test)

40.4Average Accuracy

Fine-tuned

29.89632.62335.3538.077May 24, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.05
40.4
2025.05
35.3
2025.05
34.4
2025.05
34.2
2025.05
33.5
2025.05
30.4
2025.05
30.3