Language Modeling on Pre-training corpus (train)

15.71Perplexity

Pre-LN + LayerNorm Scaling

Updated 4mo ago

Evaluation Results

Method
Pre-LN + LayerNorm Scaling 2025.02	15.71	-
Pre-LN 2025.02	17.02	-
Pre-LN + LayerNorm Scaling 2025.02	18.2	-
Pre-LN 2025.02	19.58	-
Pre-LN + LayerNorm Scaling 2025.02	20.35	-
Mix-LN 2025.02	21.39	-
Pre-LN 2025.02	21.92	-
DeepNorm 2025.02	22.77	-
Pre-LN + LayerNorm Scaling 2025.02	25.76	-
Mix-LN 2025.02	26.07	-
Pre-LN 2025.02	26.73	-
Post-LN 2025.02	26.95	-
DeepNorm 2025.02	27.17	-
DeepNorm 2025.02	1,362.59	-
Mix-LN 2025.02	1,363.21	-
Post-LN 2025.02	1,368.33	-
Post-LN 2025.02	1,390.75	-
DeepNorm 2025.02	1,409.08	-
Post-LN 2025.02	1,409.79	-
Mix-LN 2025.02	1,414.78	-
Mistral (Full-Attention) 2024.07	-	2.56
Mamba (SSM) 2024.07	-	2.62
Hybrid (Sliding Attention + SSM) 2024.07	-	2.69
BMoJo (Fading) 2024.07	-	2.68
BMoJo (Fading + Eidetic) 2024.07	-	2.67
Mistral (Full-Attention) 2024.07	-	2.27
Mamba (SSM) 2024.07	-	2.37
Hybrid (Sliding Attention + SSM) 2024.07	-	2.42
BMoJo (Fading) 2024.07	-	2.27
BMoJo (Fading + Eidetic) 2024.07	-	2.26