Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Large Language Model Evaluation on GSM8K, TruthfulQA, CommonsenseQA, MMLU, ARC, and TriviaQA
Loading...
88
Accuracy
JoBS
72.4
76.45
80.5
84.55
Feb 9, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
JoBS
Model=Qwen3-32B, Fine-...
2026.02
88
LESS + AutoLoRA
Model=Qwen3-32B, Fine-...
2026.02
86
JoBS
Model=Qwen3-14B, Fine-...
2026.02
83
DoReMi + DARTS
Model=Qwen3-32B, Fine-...
2026.02
82
JoBS
Model=Llama-3-8B-Instr...
2026.02
81
DoReMi + DARTS
Model=Qwen3-14B, Fine-...
2026.02
81
DoReMi + DARTS
Model=Llama-3-8B-Instr...
2026.02
76
LESS + AutoLoRA
Model=Qwen3-14B, Fine-...
2026.02
76
LESS + AutoLoRA
Model=Llama-3-8B-Instr...
2026.02
73
Feedback
Search any
task
Search any
task