Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BarcodeBERT: Transformers for Biodiversity Analysis

About

In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on generic supervised training algorithms. We introduce BarcodeBERT, a family of models tailored to biodiversity analysis and trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes. We compared the performance of BarcodeBERT on taxonomic identification tasks against a spectrum of machine learning approaches including supervised training of classical neural architectures and fine-tuning of general DNA foundation models. Our self-supervised pretraining strategies on domain-specific data outperform fine-tuned foundation models, especially in identification tasks involving lower taxa such as genera and species. We also compared BarcodeBERT with BLAST, one of the most widely used bioinformatics tools for sequence searching, and found that our method matched BLAST's performance in species-level classification while being 55 times faster. Our analysis of masking and tokenization strategies also provides practical guidance for building customized DNA language models, emphasizing the importance of aligning model training strategies with dataset characteristics and domain knowledge. The code repository is available at https://github.com/bioscan-ml/BarcodeBERT.

Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Scott C. Lowe, Graham W. Taylor• 2023

Related benchmarks

TaskDatasetResultRank
Species-level classificationDNA barcodes seen species
Accuracy0.997
15
Genus-level 1-NN probeDNA barcodes (unseen species)
Accuracy78.5
9
Genus ClassificationDNA Barcodes genus-level
Weighted F176.74
8
Species ClassificationDNA Barcodes seen species split
Weighted F199.74
8
Species-level classificationINSECT
Seen Accuracy38.8
8
Taxonomic ClassificationYeast (test)
Accuracy (Family)95.4
7
Taxonomic ClassificationFilamentous Fungi (test)
Accuracy (Family)87.8
7
Taxonomic ClassificationMycoAI (test)
Accuracy (Family)97.8
7
Species ClassificationDNA Barcodes species-level
Weighted F199.34
6
BIN reconstructionDNA barcodes
Accuracy79.9
6
Showing 10 of 10 rows

Other info

Follow for update