Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

About

Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct, the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings. MMECInstruct and CASLIE models are publicly accessible through https://ninglab.github.io/CASLIE/.

Xinyi Ling, Hanwen Du, Bo Peng, Zhihui Zhu, Xia Ning• 2024

Related benchmarks

Task	Dataset	Result
Image Retrieval	Fashion200k (test)	Recall@14.71	58
Multimodal Retrieval (text query to multimodal candidate)	MBE 2.0	R@126.32	50
Multimodal Retrieval	M5Product	Recall@18.4	30
Multimodal Retrieval (text query to multimodal content)	M5Product (test)	Recall@18.4	26
Classification	M5Product	Accuracy38.16	24
Product Classification	Fashion200k	Accuracy54.88	23
Multimodal Retrieval (image query to multimodal content)	M5Product (test)	Recall@18	23
Image-to-Text Retrieval	Fashion200k	R@510.04	19
Text-to-Image Retrieval	Fashion200k	Recall@511.25	19
Multimodal Retrieval (q^i -> e^mm)	MBE 3.0 1.0 (test)	Recall@19.02	13

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord