Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Capability Evaluation on Tülu General Benchmarks 3
Loading...
45
MMLU
System Prompt
36.68
38.84
41
43.16
Mar 4, 2026
MMLU
GSM8K
BBH
CodexEval
Updated 1mo ago
Evaluation Results
Method
Method
Links
MMLU
GSM8K
BBH
CodexEval
System Prompt
decoding_settings=gree...
2026.03
45
52
27
34
Llama-3.2-1B-SFT
decoding_settings=gree...
2026.03
42
50
25
24
VCL
decoding_settings=gree...
2026.03
42
51
26
25
DRO
decoding_settings=gree...
2026.03
39
49
24
23
Self-CD
decoding_settings=gree...
2026.03
38
50
26
23
Vector Ablation
decoding_settings=gree...
2026.03
37
51
25
24
Feedback
Search any
task
Search any
task