MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
About
Molecule discovery is a pivotal research field, impacting everything from medicine to materials. Recently, Large Language Models (LLMs) have been widely adopted in molecular understanding and generation, serving as a bridge between the molecular space and the natural language space, yet the alignment between molecules and their corresponding captions remains a significant challenge. Previous endeavors typically treat molecules as monolithic inputs, lacking an intermediate reasoning process and sacrificing explainability. In this work, we define fine-grained alignments as the precise correspondence between a molecule's sub-structures and the textual phrases that explain their properties. These alignments are crucial for LLMs to understand molecules in a more accurate and explainable manner. Normally, such fine-grained alignments require expert annotation, which is both costly and time-consuming. To allow LLMs to automatically label and learn the fine-grained alignments, we propose MolReFlect, a novel teacher-student framework, where a teacher LLM first generates and refines mappings between caption phrases and SMILES substructures and then explicitly teaches these detailed alignments to a student LLM. Experimental results demonstrate that MolReFlect enables LLMs to significantly outperform previous baselines, achieving the state-of-the-art performance in the molecule-caption translation task. Our codes are available via: https://github.com/phenixace/MolReFlect.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Molecular Property Classification | MoleculeNet BBBP | ROC AUC89.25 | 59 | |
| Caption-to-molecule generation | ChEBI-20 | Exact Match51 | 19 | |
| Mol2Cap | ChEBI-20 | BLEU-267.6 | 6 | |
| Cap2Mol | PubChem | BLEU76.32 | 3 | |
| molecule property prediction | MoleculeNet BACE | ROC-AUC0.8795 | 3 | |
| Molecule-to-Caption Translation | PubChem | BLEU-20.414 | 3 |