Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser
About
We introduce two first-order graph-based dependency parsers achieving a new state of the art. The first is a consensus parser built from an ensemble of independently trained greedy LSTM transition-based parsers with different random initializations. We cast this approach as minimum Bayes risk decoding (under the Hamming cost) and argue that weaker consensus within the ensemble is a useful signal of difficulty or ambiguity. The second parser is a "distillation" of the ensemble into a single model. We train the distillation parser using a structured hinge loss objective with a novel cost that incorporates ensemble uncertainty estimates for each possible attachment, thereby avoiding the intractable cross-entropy computations required by applying standard distillation objectives to problems with structured outputs. The first-order distillation parser matches or surpasses the state of the art on English, Chinese, and German.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dependency Parsing | Chinese Treebank (CTB) (test) | UAS89.8 | 99 | |
| Dependency Parsing | Penn Treebank (PTB) (test) | LAS94.6 | 80 | |
| Dependency Parsing | English PTB Stanford Dependencies (test) | UAS94.51 | 76 | |
| Dependency Parsing | CoNLL German 2009 (test) | UAS91.6 | 25 | |
| Dependency Parsing | Penn Treebank (PTB) Section 23 v2.2 (test) | UAS94.26 | 17 | |
| POS Tagging | Penn Treebank (PTB) Section 23 v2.2 (test) | POS Accuracy97.3 | 15 | |
| Dependency Parsing | CoNLL 2009 (test) | UAS91.86 | 14 | |
| Dependency Parsing | Penn Treebank Stanford Dependencies (PTB-SD) 3.3.0 (23) | UAS94.51 | 6 |