Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-matrix Factorization Attention

About

We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag--
1891
Language ModelingC4 (val)
PPL16.738
514
Commonsense ReasoningWinoGrande
Accuracy58.96
372
Common Sense ReasoningBoolQ
Accuracy63.49
212
Commonsense ReasoningARC-C--
172
Language ModelingFineWeb (val)--
159
Commonsense ReasoningARC-E
Accuracy69.02
106
Common Sense ReasoningPIQA
Accuracy75.19
71
Commonsense ReasoningOpenBookQA
Accuracy42.4
71
Language ModelingThe Pile (val)
Perplexity (bits/byte)13.903
31
Showing 10 of 15 rows

Other info

Follow for update