Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

About

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

Zhiyong Shen, Gongpeng Zhao, Jun Zhou, Li Yu, Guandong Kou, Jichen Li, Chuanlei Dong, Zuncheng Li, Kaimao Li, Bingkun Wei, Shicheng Hu, Wei Xia, Wenguo Duan• 2026

Related benchmarks

Task	Dataset	Result
OCR Evaluation	OCRBench	Score61.7	350
Multimodal Reasoning	MMStar	--	143
Multi-modal Understanding	MMVet	Accuracy36.4	55
Multimodal Knowledge and Math	MMMU (val)	Accuracy54.8	33
OCR-related understanding	DocVQA	Score85.7	28
Math & Knowledge	MathVista mini	Accuracy75.4	25
Multimodal Understanding	ShopBench	ShopFront Score65	18
OCR-related understanding	AI2D (test)	Accuracy83	14
Chinese Multi-modal Multi-task Understanding	CMMMU	Accuracy33.2	13
Chinese-language ability	Chinese-OCRBench	Score0.885	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord