Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

About

Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, Huajun Chen• 2023

Related benchmarks

Task	Dataset	Result
Molecular property prediction	QM9 (test)	--	263
Molecule Captioning	ChEBI-20 (test)	METEOR0.4341	114
Molecular Property Classification	MoleculeNet BBBP	ROC AUC58	59
Text-guided molecule generation	ChEBI-20 (test)	MACCS FTS Similarity41.2	48
Molecular Property Classification	MoleculeNet BACE	ROC AUC41.7	47
Molecular Property Classification	MoleculeNet ClinTox	ROC-AUC47.8	39
Regression	MoleculeNet Lipophilicity (test)	RMSE1.691	37
Molecule Description Generation	ChEBI-20 (test)	BLEU-20.249	34
molecule question answering task	MoleculeQA	Structure Score37.46	31
Reagent Prediction	Mol-Instructions	Exact Match4.4	30

Showing 10 of 71 rows

...

Other info

Follow for update

@wizwand_team Discord