Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

About

Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequences and/or structures for comprehensive analysis and responsive inquiries. ProteinGPT integrates protein sequence and structure encoders with linear projection layers to ensure precise representation adaptation and leverages a large language model (LLM) to generate accurate, contextually relevant responses. To train ProteinGPT, we constructed a large-scale dataset of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs per protein, and optimized the instruction-tuning process using GPT-4o. Experiments demonstrate that ProteinGPT effectively generates informative responses to protein-related questions, achieving high performance on both semantic and lexical metrics and significantly outperforming baseline models and general-purpose LLMs in understanding and responding to protein-related queries. Our code and data are available at https://github.com/ProteinGPT/ProteinGPT.

Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, Wei Wang• 2024

Related benchmarks

TaskDatasetResultRank
Domain or Motif PredictionMol-Instructions Protein-oriented
ROUGE-L0.472
11
Catalytic Activity PredictionMol-Instructions Protein-oriented
ROUGE-L40.6
11
Interaction ExtractionMol-Instructions Protein-oriented
F1 Score0.166
11
Functional Description PredictionMol-Instructions Protein-oriented
ROUGE-L0.416
11
Protein Function PredictionMol-Instructions Protein-oriented
ROUGE-L33.6
11
Showing 5 of 5 rows

Other info

Follow for update