Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

About

Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

Yongyi Yang, Jianyang Gao• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity98.3
1071
Language ModelingOpenWebText (val)
Validation Loss3.023
80
Commonsense ReasoningCommonsense Reasoning Suite (test)
HellaSwag Accuracy0.352
62
Language ModelingWikiText
Wikitext PPL58
45
Language ModelingOpenWebText (train)
Train Loss3.001
21
Language ModelingFineWeb-Edu (val)
Final Validation Loss3.006
18
Downstream Performance EvaluationCORE
CORE Score13.217
17
Language ModelingFineWeb-Edu (train)
Loss3.013
10
Language ModelingDolma
Perplexity223
10
Language ModelingFalcon
Perplexity124.9
10
Showing 10 of 17 rows

Other info

Follow for update