Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Attention Projection Mixing with Exogenous Anchors

About

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

Jonathan Su• 2026

Related benchmarks

TaskDatasetResultRank
Physical Interaction Question AnsweringPIQA
Accuracy68.77
323
Multiple-choice Question AnsweringARC Easy
Accuracy66.08
122
Multiple-choice Question AnsweringARC Challenge
Acc34.73
106
Question AnsweringWinoGrande (WG)
Accuracy55.64
98
Multiple-choice Question AnsweringHellaSwag
Accuracy44.54
59
Language Modeling(val)
Perplexity14.09
30
Multiple-choice Question AnsweringOpenBookQA
Accuracy36.4
18
Showing 7 of 7 rows

Other info

Follow for update