Adaptive Computation Depth via Learned Token Routing in Transformers

About

Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $\lambda=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup.

Ahmed Abdelmuniem Abdalla Mohammed• 2026

Related benchmarks

Task	Dataset	Result
Character-level Language Modeling	Enwik8 (val)	BPC1.8429	23
Character-level Language Modeling	Tiny Shakespeare (val)	Validation Loss1.4482	19
Sequence Copying	Toy Vocabulary 1K sequences (held-out)	Accuracy100	2
Sequence Sorting	Toy Vocabulary Sort 1K sequences (held-out)	Accuracy98.78	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord