Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals
About
We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Branch classification | Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout) | Branch Balanced Accuracy82.8 | 10 | |
| Cell Type Discrimination | Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout) | CD4/CD8 AUROC0.867 | 10 | |
| Pseudotime Estimation | Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout) | Pseudotime Spearman Correlation0.249 | 10 | |
| Stage classification | Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout) | Stage Balanced Accuracy55.2 | 10 | |
| Pseudotime Estimation | Robust V2 (pooled donor-holdout) | Mean Difference-0.092 | 9 | |
| Unsupervised branch-inference | Canonical (100k) | ARI61.3 | 8 | |
| Unsupervised branch-inference | Canonical 4 held-out donors (100k split) | ARI0.387 | 8 | |
| Stage classification | Robust cells V2 (test) | Balanced Accuracy55.2 | 4 | |
| Branch classification | Robust V2 (test) | Balanced Acc82.8 | 2 | |
| CD4/CD8 identification | Robust cells V2 (test) | AUROC0.867 | 2 |