Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
About
Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.
Federico Bianchi, Silvia Terragni, Dirk Hovy• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Classification | Newsgroup Religion (5-fold cross-validation) | Accuracy55.7 | 36 | |
| Text Classification | SMS Spam Collection (5-fold cross-validation) | Accuracy98.7 | 36 | |
| Text Classification | Newsgroup Science (5-fold cross-validation) | Accuracy0.702 | 36 | |
| Text Classification | Drug Review Norethindrone (5-fold cross-validation) | Accuracy58.6 | 36 | |
| Text Classification | Drug Review Norgestimate (5-fold cross-validation) | Accuracy62.8 | 36 | |
| Text Classification | Yelp (5-fold cross-validation) | Accuracy68.6 | 36 | |
| Topic Modeling | 20NG | NPMI0.107 | 23 | |
| Document Clustering | Newsgroup Science | Purity73 | 18 | |
| Document Clustering | Drug Review Norethindrone | Purity60 | 18 | |
| Topic Modeling | Newsgroup Science | Cv0.476 | 18 |
Showing 10 of 44 rows