Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large-Scale Chemical Language Representations Capture Molecular Structure and Properties

About

Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.

Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, Payel Das• 2021

Related benchmarks

TaskDatasetResultRank
Molecular Property ClassificationMoleculeNet BBBP
ROC AUC73.6
41
Molecular Property ClassificationMoleculeNet BACE
ROC AUC86.3
36
Molecular Property ClassificationMoleculeNet ClinTox
ROC-AUC91.2
27
Molecular Property ClassificationMoleculeNet SIDER
ROC-AUC0.655
21
RegressionMoleculeNet LIPO
RMSE0.7
19
RegressionMoleculeNet BACE (test)
RMSE1.047
14
Peptide-Protein Interaction PredictionPropedia PPI aCSM (cluster-based)
ROC-AUC0.7516
14
Peptide-Protein Interaction PredictionPropedia PPI (Random split)
ROC-AUC92.18
14
RegressionMoleculeNet Delaney ESOL
RMSE0.88
10
Membrane Permeability PredictionCycPeptMPDB
R20.5776
9
Showing 10 of 16 rows

Other info

Follow for update