Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Large Language Model Evaluation on AlpacaEval, TruthfulQA, and MMLU (test)

78.1AlpacaEval Score

Warmup-Stable-Only (WSO)

75.39676.09876.877.502Mar 17, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
78.138.734.550.4
2026.03
77.238.333.649.7
2026.03
76.437.933.949.4
2026.03
7638.433.749.4
2026.03
7637.933.949.3
2026.03
75.637.834.249.2
2026.03
75.537.933.949.1