Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Large Language Model Evaluation on GSM8K, TruthfulQA, CommonsenseQA, MMLU, ARC, and TriviaQA
Loading...
88
Accuracy
JoBS
72.4
76.45
80.5
84.55
Feb 9, 2026
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
JoBS
Model=Qwen3-32B, Fine-...
2026.02
88
LESS + AutoLoRA
Model=Qwen3-32B, Fine-...
2026.02
86
JoBS
Model=Qwen3-14B, Fine-...
2026.02
83
DoReMi + DARTS
Model=Qwen3-32B, Fine-...
2026.02
82
JoBS
Model=Llama-3-8B-Instr...
2026.02
81
DoReMi + DARTS
Model=Qwen3-14B, Fine-...
2026.02
81
DoReMi + DARTS
Model=Llama-3-8B-Instr...
2026.02
76
LESS + AutoLoRA
Model=Qwen3-14B, Fine-...
2026.02
76
LESS + AutoLoRA
Model=Llama-3-8B-Instr...
2026.02
73
Feedback
Search any
task
Search any
task