Mixture of Layers with Hybrid Attention

About

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

Ivan Ternovtsii, Yurii Bilak• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy54.38	1581
LLM Prefill Throughput	LLM Prefill Workload	Prefill Throughput (cuBLASLt)3.69e+4	114
Question Answering	ARC Challenge	Normalized Accuracy26.02	105
Commonsense Reasoning	HellaSwag	Normalized Accuracy35.1	66
Truthful Question Answering	TruthfulQA MC2	MC2 Accuracy39.55	66
Question Answering	ARC Easy	Normalized Accuracy48.15	55
Physical Reasoning	PIQA	PIQA Normalized Performance64.09	12
Decode Latency	Decode Latency Benchmark	Decode Latency (ms/tok)62.5	12
Language Modeling	Cosmopedia v2	Final Perplexity (PPL)6.49	5
Language Modeling	FineWeb-Edu 20B tokens (val)	Final PPL18.04	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord