Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adapting Vision-Language Models for E-commerce Understanding at Scale

About

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMStar--
324
Text-based Visual Question AnsweringTextVQA (val)--
262
Multimodal UnderstandingMME--
207
Multimodal UnderstandingMMBench (dev)--
58
e-CommerceeComMMMU (test)
eComMMMU Score58.3
13
VisionCVBench
CVBench Score77.2
13
OCR, Chat/Doc QAAI2D (val)
AI2D Accuracy82.6
13
ReasoningMMMU (val)
MMMU Score50.4
13
Showing 8 of 8 rows

Other info

Follow for update