Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

In Search of Lost DNA Sequence Pretraining

About

DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.

Zhijiang Tang, Jiaxin Qi, Yan Cui, Jinli Ou, Yuhua Zheng, Jianqiang Huang• 2026

Related benchmarks

TaskDatasetResultRank
Chromatin accessibility predictionBEND CA
AUROC71.28
3
CpG methylation predictionBEND CpG
AUROC89.6
3
Histone modification predictionBEND HM
AUROC76.67
3
Regulatory annotationNucleotide Transformer Benchmark NE
AUROC82.88
3
Regulatory annotationNucleotide Transformer Benchmark PA
AUROC94.25
3
Regulatory annotationNucleotide Transformer Benchmark (PNT)
AUROC95.14
3
Regulatory annotationGenomic Benchmark EC
Accuracy71.45
3
Showing 7 of 7 rows

Other info

Follow for update