Transformers with Selective Access to Early Representations
About
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first-layer value pathway while controlling access with a context-dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero-shot accuracy over the static value-residual and Transformer baselines. Its strongest gains appear on retrieval-intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth-dependent, head-specific, and category-sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy51.01 | 711 | |
| Question Answering | ARC Challenge | Accuracy (ARC)32.59 | 598 | |
| Language Modeling | LAMBADA | Accuracy43.56 | 412 | |
| Language Modeling | WikiText-103 | PPL18.47 | 216 | |
| Question Answering | ARC Easy | Accuracy65.69 | 210 | |
| Question Answering | BoolQ | Accuracy61.44 | 201 | |
| Commonsense Reasoning | SocialIQA | Accuracy40.63 | 158 | |
| Structured Web Data Extraction | SWDE | Performance32.4 | 126 | |
| Language Modeling | Pre-training (val) | Validation Loss2.139 | 55 | |
| Question Answering | SQuAD | Score36.96 | 35 |