Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

About

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance.

Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov• 2024

Related benchmarks

TaskDatasetResultRank
Gene Expression CAGE PredictionK562
MSE0.1959
10
Gene Expression CAGE PredictionGM12878
MSE0.1942
10
ClassificationGenomic Benchmarks
Mouse Enhancers Accuracy81
5
Histone mark predictionNucleotide Transformer benchmark
H3 Accuracy80.48
5
Regulatory element predictionNucleotide Transformer benchmark
Enhancer Accuracy55.2
5
Variant Effect PredictionHuman SNP 0–30k distance-to-TSS bin
AUROC0.678
5
Variant Effect PredictionHuman SNP (100k+ distance-to-TSS bin)
AUROC58
5
Variant Effect PredictionHuman SNP 30–100k distance-to-TSS bin
AUROC0.648
5
Splice site identificationNucleotide Transformer benchmark
Splice Acceptor Accuracy94.21
5
Sequence ClassificationGenomic Benchmarks Human vs. Worm (test)
Top-1 Accuracy97.3
4
Showing 10 of 18 rows

Other info

Follow for update