On the State of the Art of Evaluation in Neural Language Models

About

Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.

G\'abor Melis, Chris Dyer, Phil Blunsom• 2017

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2 (test)	PPL65.9	2333
Language Modeling	PTB (test)	Perplexity58.3	543
Language Modeling	WikiText2 (val)	Perplexity (PPL)69.1	423
Language Modeling	Penn Treebank (test)	Perplexity58.3	420
Language Modeling	WikiText2 v1 (test)	Perplexity65.9	383
Character-level Language Modeling	enwik8 (test)	BPC1.626	195
Language Modeling	Penn Treebank (val)	Perplexity60.9	178
Language Modeling	PTB (val)	Perplexity60.9	107
Language Modeling	Penn Treebank word-level (test)	Perplexity58.3	72
Character-level Language Modeling	Hutter Prize Wikipedia (test)	Bits/Char1.3	28

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord