Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sample, estimate, aggregate: A recipe for causal discovery foundation models

About

Causal discovery, the task of inferring causal structure from data, has the potential to uncover mechanistic insights from biological experiments, especially those involving perturbations. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. For example, single-cell transcriptomics measures thousands of genes, but the nature of their relationships is not known, and there may be as few as tens of cells per intervention setting. To mitigate these challenges, we propose a foundation model-inspired approach: a supervised model trained on large-scale, synthetic data to predict causal graphs from summary statistics -- like the outputs of classical causal discovery algorithms run over subsets of variables and other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of a discovery algorithm remain comparable across datasets. Theoretically, we show that the model architecture is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to misspecification and distribution shift using diverse datasets. Experiments on biological and synthetic data confirm that this model generalizes well beyond its training set, runs on graphs with hundreds of variables in seconds, and can be easily adapted to different underlying data assumptions.

Menghua Wu, Yujia Bao, Regina Barzilay, Tommi Jaakkola• 2024

Related benchmarks

TaskDatasetResultRank
Causal DiscoverySynthetic (n=100, |E|=400, sample size=1000)
mAP92.1
36
Causal DiscoverySynthetic n=1000, |E|=2000, sample size=1000
mAP66.3
32
Causal DiscoverySynthetic Data Observation-only (1000 samples)
Rank5.2
15
Causal DiscoverySemantic Causal Environment observation-only
F1 Score47.1
15
Causal DiscoverySynthetic graphs N=50, E=50
F13.9
13
Causal DiscoverySACHS p = 11, s = 20, n = 100 (real flow cytometry)
F1 Score16
13
Causal DiscoverySynthetic graphs N=100, E=100
SHD1.80e+3
10
Causal DiscoverySynthetic Data Mixed-interventional (800 observational + 200 interventional samples)
Rank5
9
Causal DiscoverySemantic Causal Environment mixed-interventional
F1 Score32.1
9
Causal DiscoverySynthetic graphs (N=20, E=20)
Structural Hamming Distance (SHD)21.9
7
Showing 10 of 23 rows

Other info

Follow for update