Adapting Vision-Language Models for E-commerce Understanding at Scale

About

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	--	407
Text-based Visual Question Answering	TextVQA (val)	--	276
Multimodal Understanding	MME	--	207
Multimodal Understanding	MMBench (dev)	--	58
Vision	CVBench	CVBench Score77.2	23
e-Commerce	eComMMMU (test)	eComMMMU Score58.3	13
OCR, Chat/Doc QA	AI2D (val)	AI2D Accuracy82.6	13
Reasoning	MMMU (val)	MMMU Score50.4	13

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord