Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FARM: Enhancing Molecular Representations with Functional Group Awareness

About

We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key idea behind FARM is the incorporation of functional group (FG) annotations at the atomic level, enabling both FG-enhanced SMILES and FG graphs. In this representation, SMILES strings are enriched with functional group information that identifies the group membership of each atom, while the FG graph captures molecular structure by representing how functional groups are connected. This tokenization injects chemical knowledge into SMILES and expands the effective molecular vocabulary, making the representation more suitable for Transformer-based models and more aligned with natural language structure. FARM learns molecular representations from two complementary perspectives to jointly encode functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with functional context, while graph neural networks model higher-level molecular topology through functional group connectivity. Contrastive learning is then used to align these two views into a unified embedding space, ensuring that both atom-level detail and functional group structure are jointly represented. We evaluate FARM on the MoleculeNet benchmark and achieve state-of-the-art performance on 8 out of 13 tasks. We further validate its generalization ability on a photostability dataset for quantum mechanical properties. These results demonstrate that FARM improves molecular representation learning, supports strong transfer learning across drug discovery and materials science, and enables broad applications in pharmaceutical research and functional material design.

Thao Nguyen, Kuan-Hao Huang, Ge Liu, Martin D. Burke, Ying Diao, Heng Ji• 2024

Related benchmarks

TaskDatasetResultRank
RegressionMoleculeNet (scaffold)
Lipo0.778
36
ClassificationMoleculeNet
BBBP Accuracy93.3
20
Molecular property predictionMoleculeNet Regression
QM8 MAE0.0146
16
ADMET Properties PredictionTDC AMES
AUROC0.875
12
drug absorption property predictionAqsol
MAE0.739
7
drug absorption property predictionBioav
ROC-AUC0.709
7
DistributionTDC PPBR
MAE7.376
2
DistributionTDC VDss
Spearman Correlation0.652
2
MetabolismTDC CYP2C9 Inhibition
AUPRC79.8
2
MetabolismTDC CYP3A4 Inhibition
AUPRC87.7
2
Showing 10 of 11 rows

Other info

Follow for update