Mixture of Layers with Hybrid Attention
About
Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
Ivan Ternovtsii, Yurii Bilak• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy54.38 | 1442 | |
| LLM Prefill Throughput | LLM Prefill Workload | Prefill Throughput (cuBLASLt)3.69e+4 | 114 | |
| Question Answering | ARC Challenge | Normalized Accuracy26.02 | 105 | |
| Question Answering | ARC Easy | Normalized Accuracy48.15 | 55 | |
| Truthful Question Answering | TruthfulQA MC2 | MC2 Accuracy39.55 | 51 | |
| Physical Reasoning | PIQA | PIQA Normalized Performance64.09 | 12 | |
| Decode Latency | Decode Latency Benchmark | Decode Latency (ms/tok)62.5 | 12 | |
| Language Modeling | Cosmopedia v2 | Final Perplexity (PPL)6.49 | 5 | |
| Commonsense Reasoning | HellaSwag | Normalized Accuracy35.1 | 3 | |
| Language Modeling | FineWeb-Edu 20B tokens (val) | Final PPL18.04 | 3 |
Showing 10 of 11 rows