Star-Transformer
About
Although Transformer has achieved great successes on many NLP tasks, its heavy structure with fully-connected attention connections leads to dependencies on large training data. In this paper, we present Star-Transformer, a lightweight alternative by careful sparsification. To reduce model complexity, we replace the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node. Thus, complexity is reduced from quadratic to linear, while preserving capacity to capture both local composition and long-range dependency. The experiments on four tasks (22 datasets) show that Star-Transformer achieved significant improvements against the standard Transformer for the modestly sized datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Inference | SNLI (test) | Accuracy86 | 681 | |
| Named Entity Recognition | CoNLL 2003 (test) | F1 Score91.98 | 539 | |
| Text Classification | Pubmed | micro-F182.35 | 50 | |
| POS Tagging | PTB (test) | Accuracy97.68 | 24 | |
| Text Classification | Reuters | Micro-F180.22 | 22 | |
| Text Classification | AAPD | Micro-F168.22 | 17 | |
| Text Classification | SemEval | Micro-F151.42 | 17 | |
| Text Classification | CAVES | Micro-F153.86 | 17 | |
| Text Classification | SST-1 (test) | Accuracy52.9 | 16 | |
| Part-of-Speech Tagging | Wall Street Journal (WSJ) section 23 (test) | Accuracy97.04 | 12 |