Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

About

Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li• 2025

Related benchmarks

Task	Dataset	Result
Image Generation	Mind-Bench	Knowledge (WK)0.00e+0	80
Single-image editing	GEdit EN (full)	BG Change7.13	42
Instruction-based Image Editing	KRIS Bench 38 (test)	Factual Score66.21	27
Instruction-based Image Editing	RISEBench 49 (test)	Reasoning35.56	27
Multi-image Reasoning	OmniContext	Single Scene Char Score8.62	20
Agentic Image Generation	IA-Bench 1.0 (test)	Checklist Accuracy (Plan)22.1	18
Anything-to-Image	OmniContext MULTIPLE	Character Fidelity Score8.07	12
Anything-to-Image	OmniContext SCENE	Character Fidelity8.62	12
Anything-to-Image	OmniContext Overall	Average Score8.09	12
Subject-driven image generation	SconeEval	Composition Single COM8.58	11

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord