Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

About

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	WeMath	Accuracy59.8	225
Mathematical Reasoning	MathVision	Accuracy50	168
Multimodal Understanding	SEEDBench2 Plus	Accuracy68.55	138
General VQA	MMVet	Score83.9	63
Mathematical Reasoning	MathVerse Vision Only	Accuracy67	52
General Visual Question Answering	MMStar	Score71.4	35
Math and Reasoning	MathVista mini	Overall Score81.4	26
General Visual Question Answering	RealworldQA	Score73.1	20
OCR and Chart Understanding	OCRBench	Total Score83.1	20
OCR VQA	InfoVQA (val)	Accuracy72.9	16

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord