Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Computation Depth via Learned Token Routing in Transformers

About

Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $\lambda=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup.

Ahmed Abdelmuniem Abdalla Mohammed• 2026

Related benchmarks

TaskDatasetResultRank
Character-level Language ModelingTiny Shakespeare (val)
Validation Loss1.4482
19
Character-level Language ModelingEnwik8 (val)
BPC1.8429
17
Sequence CopyingToy Vocabulary 1K sequences (held-out)
Accuracy100
2
Sequence SortingToy Vocabulary Sort 1K sequences (held-out)
Accuracy98.78
2
Showing 4 of 4 rows

Other info

Follow for update