Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

About

Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at https://huggingface.co/datasets/ZHNie/MBE2.0. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng• 2025

Related benchmarks

TaskDatasetResultRank
Image RetrievalFashion200k (test)
Recall@113.1
58
Multimodal Retrieval (text query to multimodal candidate)MBE 2.0
R@143.34
50
Multimodal RetrievalM5Product
Recall@115.27
30
Multimodal Retrieval (text query to multimodal content)M5Product (test)
Recall@115.27
26
ClassificationM5Product
Accuracy95.5
24
Product ClassificationFashion200k
Accuracy66.44
23
Text-to-Image RetrievalFashion200k
Recall@1031.39
18
Image-to-Text RetrievalFashion200k
R@1027.09
18
Attribute PredictionMBE 3.0 1.0 (test)
Accuracy36.36
13
Multimodal Retrieval (image query to multimodal content)M5Product (test)
Recall@111.28
13
Showing 10 of 19 rows

Other info

Follow for update