Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections

About

Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.

Zhaoyi Liu, Haichuan Zhang, Ang Li• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity97.2
1071
Language ModelingOpenWebText (val)
Validation Loss3.012
80
Language ModelingWikiText
Wikitext PPL57
45
Language ModelingOpenWebText (train)
Train Loss2.998
21
Language ModelingFineWeb-Edu (val)
Final Validation Loss3.003
18
Language ModelingFineWeb-Edu (train)
Loss2.993
10
Language ModelingDolma
Perplexity218.8
10
Language ModelingFalcon
Perplexity123.3
10
Language ModelingRedPajama
Perplexity1.13e+3
10
Showing 9 of 9 rows

Other info

Follow for update