BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
About
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Molecule Captioning | ChEBI-20 (test) | BLEU-40.591 | 107 | |
| Molecular property prediction | BACE (test) | ROC-AUC86.2 | 65 | |
| Text-guided molecule generation | ChEBI-20 (test) | MACCS FTS Similarity90.7 | 48 | |
| Molecular Property Classification | MoleculeNet BBBP | ROC AUC65.1 | 41 | |
| Molecular Property Classification | MoleculeNet BACE | ROC AUC81.1 | 36 | |
| Classification | MoleculeNet BBBP (test) | ROC AUC0.765 | 30 | |
| Molecular Property Classification | MoleculeNet ClinTox | ROC-AUC83.7 | 27 | |
| Description-guided molecule design | ChEBI-20 2022 (test) | Exact Match Accuracy52.2 | 26 | |
| Reagent Prediction | Mol-Instructions | Exact Match25.7 | 24 | |
| Retrosynthesis | Mol-Instructions | Exact Match64.2 | 24 |