Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences

About

Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism's intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.

Siquan Li, Yao Tong, Haonan Wang, Tianyang Hu• 2026

Related benchmarks

TaskDatasetResultRank
Model FingerprintingLLaMA-2 7B fine-tuned variants--
5
Model Fingerprint VerificationLLAMA
t-test4.00e-10
4
Fingerprint persistenceTinyStories cleaned V2
T-Test Statistic0.00e+0
2
Fingerprint persistencethe_stack
T-Test P-Value0.00e+0
2
Model Fingerprint VerificationTinyStories (test)
t-test p-value8.49e-214
2
Model Fingerprint Verificationthe_stack (test)
T-Test P-Value1.16e-211
2
Showing 6 of 6 rows

Other info

Follow for update