Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adapting Vision-Language Models for E-commerce Understanding at Scale

About

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMStar--
197
Multimodal UnderstandingMME--
158
Text-based Visual Question AnsweringTextVQA (val)--
146
Multimodal UnderstandingMMBench (dev)--
58
e-CommerceeComMMMU (test)
eComMMMU Score58.3
13
VisionCVBench
CVBench Score77.2
13
OCR, Chat/Doc QAAI2D (val)
AI2D Accuracy82.6
13
ReasoningMMMU (val)
MMMU Score50.4
13
Showing 8 of 8 rows

Other info

Follow for update