Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Translation between Molecules and Natural Language

About

We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji• 2022

Related benchmarks

TaskDatasetResultRank
Molecule CaptioningChEBI-20 (test)
METEOR0.614
114
Text-guided molecule generationChEBI-20 (test)
MACCS FTS Similarity83.4
48
Molecule Description GenerationChEBI-20 (test)
BLEU-20.54
34
Quantitative Solute-Solvent InteractionFreeSolv (test)
RMSE1.135
29
Description-guided molecule designChEBI-20 2022 (test)
Exact Match Accuracy31.1
26
RegressionMoleculeNet Lipophilicity
RMSE0.65
21
Molecule Description GenerationChEBI-20 2022 (test)
BLEU-20.54
20
Molecule CaptioningMol-Instructions
ROUGE-L0.594
17
RegressionMoleculeNet Freesolv
RMSE1.55
16
Forward reaction predictionMol-Instructions (test)
EM Score0.897
15
Showing 10 of 72 rows
...

Other info

Code

Follow for update