Adapting Vision-Language Models for E-commerce Understanding at Scale
About
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMStar | -- | 197 | |
| Multimodal Understanding | MME | -- | 158 | |
| Text-based Visual Question Answering | TextVQA (val) | -- | 146 | |
| Multimodal Understanding | MMBench (dev) | -- | 58 | |
| e-Commerce | eComMMMU (test) | eComMMMU Score58.3 | 13 | |
| Vision | CVBench | CVBench Score77.2 | 13 | |
| OCR, Chat/Doc QA | AI2D (val) | AI2D Accuracy82.6 | 13 | |
| Reasoning | MMMU (val) | MMMU Score50.4 | 13 |