PRISM: PRIor from corpus Statistics for topic Modeling
About
Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Topic Modeling | 20NG | NPMI0.1168 | 33 | |
| Topic Modeling | DBLP | NPMI0.0751 | 23 | |
| Topic Modeling | M10 | NPMI0.0822 | 23 | |
| Topic Modeling | TrumpTweets | Coherence Value (Cv)0.5571 | 10 | |
| Word Intrusion Detection | 20NewsGroup | Accuracy60.99 | 10 | |
| Topic Modeling | BBC News | Coherence Value (CV)0.6781 | 10 | |
| Word Intrusion Detection | BBC | Accuracy62.01 | 10 | |
| Word Intrusion Detection | M10 | Accuracy39.21 | 10 | |
| Word Intrusion Detection | DBLP | Accuracy34.08 | 10 | |
| Word Intrusion Detection | TrumpTweets | Accuracy30.57 | 10 |