MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

About

Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at https://huggingface.co/datasets/ZHNie/MBE2.0. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng• 2025

Related benchmarks

Task	Dataset	Result
Image Retrieval	Fashion200k (test)	Recall@113.1	58
Multimodal Retrieval (text query to multimodal candidate)	MBE 2.0	R@143.34	50
Multimodal Retrieval	M5Product	Recall@115.27	30
Multimodal Retrieval (text query to multimodal content)	M5Product (test)	Recall@115.27	26
Classification	M5Product	Accuracy95.5	24
Product Classification	Fashion200k	Accuracy66.44	23
Multimodal Retrieval (image query to multimodal content)	M5Product (test)	Recall@111.28	23
Text-to-Image Retrieval	Fashion200k	Recall@525.25	19
Image-to-Text Retrieval	Fashion200k	R@523.16	19
Attribute Prediction	MBE 3.0 1.0 (test)	Accuracy36.36	13

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord