Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

About

Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require $O(N^2)$ computing cost for sequence length $N$. Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as $O(N \log N)$. Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck. We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks. The experimental results show that Paramixer has better performance in most learning tasks.

Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang• 2022

Related benchmarks

Task	Dataset	Result
Long sequence classification	LRA (Long Range Arena) (test)	Average Accuracy83.32	92
Long-sequence modeling	Long Range Arena (LRA) v1 (test)	ListOps39.71	66
Classification	LRA ListOps N=2000 (test)	Accuracy39.57	39
Classification	LRA Pathfinder N=1024 (test)	Accuracy80.49	23
Classification	LRA Image N=1024 (test)	Accuracy46.58	23
Long Document Classification	Long-document-dataset (test)	Accuracy84.55	14
DNA Sequence-based Taxonomy Classification	Ensembl (B/S) 3 (test)	Accuracy66.77	9
Long Document Classification	LongDoc16K (test)	Accuracy79.6	9
Long Document Classification	LongDoc 32K (test)	Accuracy74.76	9
DNA Sequence-based Taxonomy Classification	Ensembl (M/R) 3 (test)	Accuracy56.37	9

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord