Text-Only Data Synthesis for Vision Language Model Training

About

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong• 2025

Related benchmarks

Task	Dataset	Result
Hallucination Evaluation	POPE	--	281
Multimodal Reasoning	LogicVista	Accuracy29.53	172
General image understanding	MMStar	Accuracy35.13	67
General Visual Understanding	RealworldQA	Accuracy42.35	64
Hallucination Evaluation	HallBench	Accuracy43.01	59
Multimodal Reasoning	MMMU	MMMU Score36.87	27
Multimodal Reasoning	VisuLogic	Pass@126.8	21
Hallucination Evaluation	CRPE	Score42.32	14
General Visual Understanding	MME	MME Score60.24	4
General Visual Understanding	SQA	SQA Score68.81	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord