PRISM: PRIor from corpus Statistics for topic Modeling

About

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Tal Ishon, Yoav Goldberg, Uri Shaham• 2026

Related benchmarks

Task	Dataset	Result
Topic Modeling	20NG	NPMI0.1168	33
Topic Modeling	DBLP	NPMI0.0751	23
Topic Modeling	M10	NPMI0.0822	23
Topic Modeling	TrumpTweets	Coherence Value (Cv)0.5571	10
Word Intrusion Detection	20NewsGroup	Accuracy60.99	10
Topic Modeling	BBC News	Coherence Value (CV)0.6781	10
Word Intrusion Detection	BBC	Accuracy62.01	10
Word Intrusion Detection	M10	Accuracy39.21	10
Word Intrusion Detection	DBLP	Accuracy34.08	10
Word Intrusion Detection	TrumpTweets	Accuracy30.57	10

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord