Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

About

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

Zhiyong Shen, Gongpeng Zhao, Jun Zhou, Li Yu, Guandong Kou, Jichen Li, Chuanlei Dong, Zuncheng Li, Kaimao Li, Bingkun Wei, Shicheng Hu, Wei Xia, Wenguo Duan• 2026

Related benchmarks

TaskDatasetResultRank
OCR EvaluationOCRBench
Score61.7
296
Multimodal ReasoningMMStar--
81
Multi-modal UnderstandingMMVet
Accuracy36.4
35
Math & KnowledgeMathVista mini
Accuracy75.4
25
Multimodal UnderstandingShopBench
ShopFront Score65
18
Multimodal Knowledge and MathMMMU (val)
Accuracy54.8
14
Chinese Multi-modal Multi-task UnderstandingCMMMU
Accuracy33.2
13
OCR-related understandingDocVQA
Score85.7
10
Chinese-language abilityChinese-OCRBench
Score0.885
6
Comprehensive multimodal understandingMMBench en (dev)
Overall Score0.82
6
Showing 10 of 14 rows

Other info

Follow for update