Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

About

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Yatao Bian, Dongzhan Zhou, Xiao-yong Wei, Qing Li• 2024

Related benchmarks

Task	Dataset	Result
Molecular Optimization (QED)	TOMG-Bench	Success Rate (SR)57.86	39
Molecular Optimization (LogP)	TOMG-Bench	Success Rate (SR)80.54	39
Molecular Optimization (MR)	TOMG-Bench	Success Rate (SR)78.76	39
Text-guided molecule generation	S²-Bench	--	10
Single Property Optimization	Single Property Optimization (test)	Average Score68	9
Molecule Editing	S2-Bench (test)	SR (AddComp)77.9	9
Single-property Molecule Optimization	S^2-Bench v1.0 (test)	logP SR88.22	9
Molecular Component Editing	Molecular Component Editing	Average Success Rate54.5	9

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord