Learning Invariant Molecular Representation in Latent Discrete Space
About
Molecular representation learning lays the foundation for drug discovery. However, existing methods suffer from poor out-of-distribution (OOD) generalization, particularly when data for training and testing originate from different environments. To address this issue, we propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts. Specifically, we propose a strategy called ``first-encoding-then-separation'' to identify invariant molecule features in the latent space, which deviates from conventional practices. Prior to the separation step, we introduce a residual vector quantization module that mitigates the over-fitting to training data distributions while preserving the expressivity of encoders. Furthermore, we design a task-agnostic self-supervised learning objective to encourage precise invariance identification, which enables our method widely applicable to a variety of tasks, such as regression and multi-label classification. Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts. Our code is available at https://github.com/HICAI-ZJU/iMoLD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Graph Classification | DrugOOD EC50 (OOD test) | ROC AUC77.48 | 52 | |
| Graph Classification | MolHiv GOOD (size) | ROC-AUC62.2 | 28 | |
| Graph Classification | MolHiv GOOD (scaffold) | ROC AUC69.06 | 28 | |
| Graph Classification | GOOD-SST2 length | Accuracy79.41 | 28 | |
| Graph Classification | HIV GraphOOD (test) | ROC-AUC72.93 | 26 | |
| Regression | GOOD-ZINC size split, covariate shift | Mean Absolute Error0.1029 | 24 | |
| Graph Classification | DrugOOD IC50 (test) | ROC AUC72.11 | 24 | |
| Binary Classification | GOOD-HIV scaffold split, covariate shift | ROC-AUC72.93 | 15 | |
| Binary Classification | GOOD-HIV scaffold split concept shift | ROC AUC0.7432 | 15 | |
| Binary Classification | GOOD-HIV size split, covariate shift | ROC-AUC0.6286 | 15 |