Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Held-out capability

Benchmarks

Task NameDataset NameSOTA ResultTrend
Capability RetentionHeld-out capability AIME-2024, MATH-500, GSM8K, HumanEval, MMLU (test)
AIME-2024 Score63.5
39
General capability retentionHeld-out capability average
Unweighted Average Accuracy82.6
21
Showing 2 of 2 rows