Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sparse Autoencoders for Hypothesis Generation

About

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson• 2025

Related benchmarks

TaskDatasetResultRank
Signal RecoverySynthetic dataset
SURF Score90
5
Hypothesis DiscoveryHypoBench
Deception Score8
3
Hypothesis DiscoveryTwitter
Significant Hypotheses Count12
3
Hypothesis DiscoveryDesign
Significant Hypotheses Found6
3
Hypothesis DiscoveryCongress
Significant Hypotheses Found12
3
Hypothesis DiscoveryCMV
Significant Hypotheses Found8
3
Hypothesis DiscoveryLaMem
Significant Hypotheses0.00e+0
3
Showing 7 of 7 rows

Other info

Follow for update