Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

About

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

Tomohiro Hayase, Ryo Karakida• 2026

Related benchmarks

TaskDatasetResultRank
Self-attention inverse temperature scaling analysisPG19--
2
Self-attention inverse temperature scaling analysisProof-Pile-2 (PP2)--
2
Self-attention inverse temperature scaling analysisSlimPajama--
2
Self-attention inverse temperature scaling analysisOpenWebText (OWT)--
2
Showing 4 of 4 rows

Other info

Follow for update