Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

About

Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.

Federico Bianchi, Silvia Terragni, Dirk Hovy• 2020

Related benchmarks

Task	Dataset	Result
Text Classification	Newsgroup Religion (5-fold cross-validation)	Accuracy55.7	36
Text Classification	SMS Spam Collection (5-fold cross-validation)	Accuracy98.7	36
Text Classification	Newsgroup Science (5-fold cross-validation)	Accuracy0.702	36
Text Classification	Drug Review Norethindrone (5-fold cross-validation)	Accuracy58.6	36
Text Classification	Drug Review Norgestimate (5-fold cross-validation)	Accuracy62.8	36
Text Classification	Yelp (5-fold cross-validation)	Accuracy68.6	36
Topic Modeling	20NG	NPMI0.107	33
Topic Modeling	Jamuna News	CV0.67	29
Topic Modeling	NCTBText	CV0.64	29
Topic Modeling	BanFakeNews	CV0.71	25

Showing 10 of 48 rows

Other info

Follow for update

@wizwand_team Discord