Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DPLM-2: A Multimodal Diffusion Protein Language Model

About

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu• 2024

Related benchmarks

TaskDatasetResultRank
Lead Optimization in Protein FoldingCAMEO 183 targets 2022
RMSD (Base)10.5
8
Lead Optimization in Protein FoldingPDB date-split (449 targets)
RMSD (Base)8.45
8
Protein GenerationProtein Structure Generation (test)
Designability0.486
8
Protein ReconstructionCAMEO (512 samples)
RMSD1.65
8
Protein ReconstructionCATH 512 samples
RMSD1.64
8
Protein ReconstructionAFDB 512 samples
RMSD4.68
8
Inverse foldingCAMEO benchmark 2022
AAR45.35
7
Inverse foldingPDB (date-split)
AAR51.3
7
Forward foldingCAMEO subset (n=163)
TM-score0.724
4
Motif Scaffolding24 curated motifs (summary)
Mean RMSD2.971
4
Showing 10 of 10 rows

Other info

Follow for update