Attention Projection Mixing with Exogenous Anchors

About

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

Jonathan Su• 2026

Related benchmarks

Task	Dataset	Result
Physical Interaction Question Answering	PIQA	Accuracy68.77	415
Multiple-choice Question Answering	ARC Easy	Accuracy66.08	257
Multiple-choice Question Answering	HellaSwag	Accuracy44.54	196
Question Answering	WinoGrande (WG)	Accuracy55.64	138
Multiple-choice Question Answering	ARC Challenge	Acc34.73	133
Language Modeling	(val)	Perplexity14.09	30
Multiple-choice Question Answering	OpenBookQA	Accuracy36.4	26

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord