H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

About

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

Zhenhai Zhu, Radu Soricut• 2021

Related benchmarks

Task	Dataset	Result
Long-range sequence modeling	Long Range Arena (LRA)	Text Accuracy78.69	177
Language Modeling	One Billion Word Benchmark (test)	Test Perplexity20.25	125
Long-sequence modeling	Long Range Arena (LRA) v1 (test)	ListOps49.53	66
Hierarchical Reasoning	ListOps Long Range Arena (test)	Accuracy49.53	26
Sequence Modeling	Long Range Arena (val)	ListOps Accuracy49.53	26
Long-range sequence modeling	LRA 92 (test)	ListOps Accuracy49.53	26
Hierarchical reasoning on symbolic sequences	Long ListOps (test)	Accuracy49.53	22

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord