Suppressing Final Layer Hidden State Jumps in Transformer Pretraining

About

This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump'' in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.

Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, Jun Suzuki• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	MT-Bench	MT-Bench Score3.4	287
Instruction Following	WizardLM (test)	Score4.23	25
Instruction Following	Vicuna-bench	Score6.36	13
Zero-shot Downstream Task Evaluation	ARC-e, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SocialIQA, SciQ, SWAG	ARC-e Accuracy77.9	12

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord