Share your thoughts, 1 month free Claude Pro on usSee more

Large Language Model Evaluation on AlpacaEval, TruthfulQA, and MMLU (test)

78.1AlpacaEval Score

Warmup-Stable-Only (WSO)

Updated 2mo ago

Evaluation Results

Method	Links
Warmup-Stable-Only (WSO) 2026.03		78.1	38.7	34.5	50.4
WSD 2026.03		77.2	38.3	33.6	49.7
Cosine 2026.03		76.4	37.9	33.9	49.4
WSD 2026.03		76	38.4	33.7	49.4
Cosine 2026.03		76	37.9	33.9	49.3
Linear 2026.03		75.6	37.8	34.2	49.2
Linear 2026.03		75.5	37.9	33.9	49.1