Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mixture of Layers with Hybrid Attention

About

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

Ivan Ternovtsii, Yurii Bilak• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy54.38
1442
LLM Prefill ThroughputLLM Prefill Workload
Prefill Throughput (cuBLASLt)3.69e+4
114
Question AnsweringARC Challenge
Normalized Accuracy26.02
105
Question AnsweringARC Easy
Normalized Accuracy48.15
55
Truthful Question AnsweringTruthfulQA MC2
MC2 Accuracy39.55
51
Physical ReasoningPIQA
PIQA Normalized Performance64.09
12
Decode LatencyDecode Latency Benchmark
Decode Latency (ms/tok)62.5
12
Language ModelingCosmopedia v2
Final Perplexity (PPL)6.49
5
Commonsense ReasoningHellaSwag
Normalized Accuracy35.1
3
Language ModelingFineWeb-Edu 20B tokens (val)
Final PPL18.04
3
Showing 10 of 11 rows

Other info

Follow for update