Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Diffusion Language Models Are Versatile Protein Learners

About

This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{https://github.com/bytedance/dplm}.

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu• 2024

Related benchmarks

TaskDatasetResultRank
LocalizationDL Bin PFMBench (test)
Score0.9331
11
LocalizationDL Multi PFMBench (test)
Score0.7803
11
InteractionM. I. Bin. PFMBench (test)
Score70.056
10
Protein Sequence GenerationUniRef50
pLDDT80.23
9
InteractionBindingDB
Score0.1741
8
Unconditional DNA Sequence GenerationEPD-GenDNA Sequence Length = 2048 revised NT downstream dataset (test)
DIV1.53e+3
8
Unconditional DNA Sequence GenerationEPD-GenDNA NT downstream Sequence Length = 256 revised (test)
DIV191
8
Protein Sequence GenerationProtein Sequence Generation
pLDDT80.23
6
Protein Sequence GenerationUniRef50 (test)
pLDDT80.23
5
Showing 9 of 9 rows

Other info

Follow for update