Regularizing and Optimizing LSTM Language Models
About
Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText-2 (test) | PPL44.3 | 1541 | |
| Language Modeling | PTB (test) | Perplexity57.3 | 471 | |
| Language Modeling | Penn Treebank (test) | Perplexity51.1 | 411 | |
| Language Modeling | WikiText2 v1 (test) | Perplexity52 | 341 | |
| Language Modeling | WikiText2 (val) | Perplexity (PPL)46.4 | 277 | |
| Character-level Language Modeling | enwik8 (test) | BPC1.232 | 195 | |
| Language Modeling | Penn Treebank (val) | Perplexity51.6 | 178 | |
| Language Modeling | Penn Treebank (PTB) (test) | Perplexity51.1 | 120 | |
| Language Modeling | PTB (val) | Perplexity60 | 83 | |
| Language Modeling | Penn Treebank word-level (test) | Perplexity51.1 | 72 |