Directed Graphical Models and Causal Discovery for Zero-Inflated Data
About
Modern RNA sequencing technologies provide gene expression measurements from single cells that promise refined insights on regulatory relationships among genes. Directed graphical models are well-suited to explore such (cause-effect) relationships. However, statistical analyses of single cell data are complicated by the fact that the data often show zero-inflated expression patterns. To address this challenge, we propose directed graphical models that are based on Hurdle conditional distributions parametrized in terms of polynomials in parent variables and their 0/1 indicators of being zero or nonzero. While directed graphs for Gaussian models are only identifiable up to an equivalence class in general, we show that, under a natural and weak assumption, the exact directed acyclic graph of our zero-inflated models can be identified. We propose methods for graph recovery, apply our model to real single-cell RNA-seq data on T helper cells, and show simulated experiments that validate the identifiability and graph estimation methods in practice.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| DAG structure learning | Simulated zero-inflated count data (ER graph) D=50 (test) | TPR0.016 | 11 | |
| DAG structure learning | Simulated zero-inflated count data BA graph D=50 (test) | TPR4.5 | 11 |