Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-Task Skill Learning Suite (ScienceQA, MMLU, Hellaswag, Humaneval, TruthfulQA, Winogrande, IFeval)
Loading...
70.2
ScienceQA
SDFT
30.576
40.863
51.15
61.437
Jan 27, 2026
ScienceQA
Hellaswag
HumanEval
IFeval
MMLU
TruthfulQA
Winogrande
Average Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
ScienceQA
Hellaswag
HumanEval
IFeval
MMLU
TruthfulQA
Winogrande
Average Score
SDFT
2026.01
70.2
60.9
68.9
66.8
70.7
46.5
73.1
64.5
SFT
2026.01
66.2
55
54.8
35.3
64.6
36.8
73.7
53.4
SFT + re-invoke
2026.01
66
61.6
63.4
52.9
68.7
45.2
70
60.2
DFT
2026.01
54.8
57.6
67
60.4
69.4
38.8
68.2
60.2
Base
Backbone=Qwen2.5-7B
2026.01
32.1
62
65.8
74.3
71.7
47.9
71.1
65.5
Feedback
Search any
task
Search any
task