Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

About

We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.

Ihor Kendiukhov• 2026

Related benchmarks

Task	Dataset	Result
Branch classification	Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)	Branch Balanced Accuracy82.8	10
Cell Type Discrimination	Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)	CD4/CD8 AUROC0.867	10
Pseudotime Estimation	Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)	Pseudotime Spearman Correlation0.249	10
Stage classification	Tabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)	Stage Balanced Accuracy55.2	10
Pseudotime Estimation	Robust V2 (pooled donor-holdout)	Mean Difference-0.092	9
Unsupervised branch-inference	Canonical (100k)	ARI61.3	8
Unsupervised branch-inference	Canonical 4 held-out donors (100k split)	ARI0.387	8
Stage classification	Robust cells V2 (test)	Balanced Accuracy55.2	4
Branch classification	Robust V2 (test)	Balanced Acc82.8	2
CD4/CD8 identification	Robust cells V2 (test)	AUROC0.867	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord