ES-Merging: Biological MLLM Merging via Embedding Space Signals

About

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose the Embedding-Signal-based MLLM Merging (ES-Merging), a framework that estimates merging coefficients from embedding space signals, moving the merging paradigm from the parameter signals to the embedding signals. ES-Merging exploits coarse-grained and fine-grained signals from embedding space to estimate the layer-wise and element-wise merging coefficients, respectively, which are jointly combined for complementary coefficient estimation. Through extensive experiments, we demonstrate that ES-Merging outperforms existing merging methods not only on the cross-modal reasoning but also on the single-modal knowledge preservation, establishing that embedding space signals provide a principled and effective foundation for MLLM merging.

Wonbin Lee, Dongki Kim, Sung Ju Hwang• 2026

Related benchmarks

Task	Dataset	Result
Drug-Target Interaction Prediction	BIOSNAP	Accuracy0.691	28
CYP Inhibition Prediction	TDC CYP Inhibition	Accuracy (CYP1A2)77.4	13
Molecule-Cell Interaction	GDSC 2	Accuracy94.1	13
Molecule-Protein Interaction	BindingDB	Accuracy66	13
Molecule-Protein Interaction	Human	Accuracy62	13
CYP Substrate Prediction	TDC CYP Substrate	CYP2C9 Accuracy64.2	13
Molecule-Cell Interaction	DrugComb	Accuracy80.7	13

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord