Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRISM: PRIor from corpus Statistics for topic Modeling

About

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Tal Ishon, Yoav Goldberg, Uri Shaham• 2026

Related benchmarks

TaskDatasetResultRank
Topic Modeling20NG
NPMI0.1168
33
Topic ModelingDBLP
NPMI0.0751
23
Topic ModelingM10
NPMI0.0822
23
Topic ModelingTrumpTweets
Coherence Value (Cv)0.5571
10
Word Intrusion Detection20NewsGroup
Accuracy60.99
10
Topic ModelingBBC News
Coherence Value (CV)0.6781
10
Word Intrusion DetectionBBC
Accuracy62.01
10
Word Intrusion DetectionM10
Accuracy39.21
10
Word Intrusion DetectionDBLP
Accuracy34.08
10
Word Intrusion DetectionTrumpTweets
Accuracy30.57
10
Showing 10 of 16 rows

Other info

Follow for update