Multi-matrix Factorization Attention
About
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | -- | 1896 | |
| Language Modeling | C4 (val) | PPL16.738 | 737 | |
| Commonsense Reasoning | WinoGrande | Accuracy58.96 | 453 | |
| Common Sense Reasoning | BoolQ | Accuracy63.49 | 240 | |
| Language Modeling | FineWeb (val) | -- | 217 | |
| Commonsense Reasoning | ARC-C | -- | 215 | |
| Commonsense Reasoning | ARC-E | Accuracy69.02 | 152 | |
| Commonsense Reasoning | OpenBookQA | Accuracy42.4 | 108 | |
| Common Sense Reasoning | PIQA | Accuracy75.19 | 100 | |
| Language Modeling | FineWeb-Edu (val) | Perplexity9.506 | 51 |