Attention Projection Mixing with Exogenous Anchors
About
Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Physical Interaction Question Answering | PIQA | Accuracy68.77 | 323 | |
| Multiple-choice Question Answering | ARC Easy | Accuracy66.08 | 122 | |
| Multiple-choice Question Answering | ARC Challenge | Acc34.73 | 106 | |
| Question Answering | WinoGrande (WG) | Accuracy55.64 | 98 | |
| Multiple-choice Question Answering | HellaSwag | Accuracy44.54 | 59 | |
| Language Modeling | (val) | Perplexity14.09 | 30 | |
| Multiple-choice Question Answering | OpenBookQA | Accuracy36.4 | 18 |