BERTopic: Neural topic modeling with a class-based TF-IDF procedure
About
Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Classification | AGNews | Accuracy66.6 | 119 | |
| Text Classification | 20News | Accuracy59.1 | 101 | |
| Topic Modeling | Yelp | -- | 18 | |
| Topic Modeling | BBC | NPMI0.085 | 17 | |
| Topic Modeling | AGNews | Diversity48.7 | 14 | |
| Document Retrieval | StackOverflow (test) | Precision@530.6 | 11 | |
| Topic Modeling | 20NewsGroup | Cv0.36 | 11 | |
| Topic Modeling | TweetTopic | Cv0.364 | 11 | |
| Topic Modeling | StackOverflow | Cv0.374 | 11 | |
| Document Retrieval | TweetTopic (test) | P@554.6 | 11 |