Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

About

Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.

Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan• 2024

Related benchmarks

TaskDatasetResultRank
Molecule CaptioningChEBI-20 (test)
BLEU-40.591
107
Molecular property predictionBACE (test)
ROC-AUC86.2
65
Text-guided molecule generationChEBI-20 (test)
MACCS FTS Similarity90.7
48
Molecular Property ClassificationMoleculeNet BBBP
ROC AUC65.1
41
Molecular Property ClassificationMoleculeNet BACE
ROC AUC81.1
36
ClassificationMoleculeNet BBBP (test)
ROC AUC0.765
30
Molecular Property ClassificationMoleculeNet ClinTox
ROC-AUC83.7
27
Description-guided molecule designChEBI-20 2022 (test)
Exact Match Accuracy52.2
26
Reagent PredictionMol-Instructions
Exact Match25.7
24
RetrosynthesisMol-Instructions
Exact Match64.2
24
Showing 10 of 40 rows

Other info

Code

Follow for update