Share your thoughts, 1 month free Claude Pro on usSee more

Large Language Model Evaluation on GSM8K, TruthfulQA, CommonsenseQA, MMLU, ARC, and TriviaQA

88Accuracy

JoBS

Updated 5mo ago

Evaluation Results

Method	Links
JoBS 2026.02		88
LESS + AutoLoRA 2026.02		86
JoBS 2026.02		83
DoReMi + DARTS 2026.02		82
JoBS 2026.02		81
DoReMi + DARTS 2026.02		81
DoReMi + DARTS 2026.02		76
LESS + AutoLoRA 2026.02		76
LESS + AutoLoRA 2026.02		73