Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

About

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy10.5	1398
Multi-task Language Understanding	MMLU	Accuracy38.5	881
Question Answering	OpenBookQA	Accuracy40.4	465
Reasoning	ARC	Accuracy57.5	245
Reasoning	HellaSwag (HS)	HellaSwag Accuracy61.3	209
Reasoning	WinoGrande (WG)	Accuracy55.8	168
Reasoning	PIQA	Accuracy75.5	164
Question Answering	CommonsenseQA (CSQA)	Accuracy37.6	124
Reasoning	SIQA	Accuracy44.4	44
Trivia QA	Trivia QA	Accuracy22.4	32

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord