Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

About

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song• 2025

Related benchmarks

TaskDatasetResultRank
Question Answering and ReasoningDownstream Reasoning Suite (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ)
ARC-e49.23
14
Language ModelingPretraining Dataset
Train Loss (PT)3.133
10
Language Modeling and Zero-shot ReasoningStandard LLM Evaluation Suite ARC-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ
PT Eval Loss3.254
5
Pre-trainingPre-training (evaluation)
Pre-training Eval Loss3.254
5
Supervised Fine-tuningSFT (evaluation)
SFT Evaluation Loss3.13
5
Supervised Fine-tuningSFT (train)
SFT Train Loss2.693
5
Zero-shot EvaluationZero-shot Downstream Tasks (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ) Llama-1B Benchmark Suite (test)
Arc-e Accuracy30.5
5
Zero-shot Downstream Task EvaluationDownstream Evaluation Suite (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ)
Arc-e53.83
4
Language Modeling20B token pretraining corpus
PT Train Loss2.756
2
Supervised Fine-tuningSupervised Fine-Tuning (SFT)
SFT Training Loss2.468
2
Showing 10 of 10 rows

Other info

Follow for update