Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
About
Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct, the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings. MMECInstruct and CASLIE models are publicly accessible through https://ninglab.github.io/CASLIE/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Retrieval | Fashion200k (test) | Recall@14.71 | 58 | |
| Multimodal Retrieval (text query to multimodal candidate) | MBE 2.0 | R@126.32 | 50 | |
| Multimodal Retrieval | M5Product | Recall@18.4 | 30 | |
| Multimodal Retrieval (text query to multimodal content) | M5Product (test) | Recall@18.4 | 26 | |
| Classification | M5Product | Accuracy38.16 | 24 | |
| Product Classification | Fashion200k | Accuracy54.88 | 23 | |
| Image-to-Text Retrieval | Fashion200k | R@1013.89 | 18 | |
| Text-to-Image Retrieval | Fashion200k | Recall@1014.12 | 18 | |
| Multimodal Retrieval (image query to multimodal content) | M5Product (test) | Recall@18 | 13 | |
| Multimodal Retrieval (q^i -> e^mm) | MBE 3.0 1.0 (test) | Recall@19.02 | 13 |