Grammar as a Foreign Language
About
Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Phrase-structure parsing | PTB (§23) | F1 Score92.1 | 56 | |
| Constituency Parsing | Penn Treebank WSJ (section 23 test) | F1 Score92.8 | 55 | |
| Constituency Parsing | WSJ Penn Treebank (test) | F1 Score92.8 | 27 | |
| English constituency parsing | Wall Street Journal (WSJ) (Section 23) | F1 Score92.1 | 12 | |
| Constituency Parsing | Penn Treebank WSJ section 22 (dev) | F1 Score93.5 | 9 |