Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

About

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningWeMath
Accuracy59.8
161
Mathematical ReasoningMathVision
Accuracy50
144
General VQAMMVet
Score83.9
40
General Visual Question AnsweringMMStar
Score71.4
35
Mathematical ReasoningMathVerse Vision Only
Accuracy67
34
General Visual Question AnsweringRealworldQA
Score73.1
20
OCR and Chart UnderstandingOCRBench
Total Score83.1
20
General VQAMMMU (val)
Score66.8
15
General VQAPOPE
Accuracy84.8
14
OCR VQAInfoVQA (val)
Accuracy72.9
12
Showing 10 of 29 rows

Other info

Follow for update