Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

About

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu• 2024

Related benchmarks

TaskDatasetResultRank
Species-level classificationDNA barcodes seen species
Accuracy0.997
15
Genus-level 1-NN probeDNA barcodes (unseen species)
Accuracy30.6
9
Species ClassificationDNA Barcodes seen species split
Weighted F199.74
8
Genus ClassificationDNA Barcodes genus-level
Weighted F148
8
BIN reconstructionDNA barcodes
Accuracy62.8
6
Species ClassificationDNA Barcodes species-level
Weighted F196.85
6
Showing 6 of 6 rows

Other info

Follow for update