ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
About
Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | GEdit-Bench-EN (full) | G-Score (O)3.83 | 66 | |
| Single-image editing | GEdit EN (full) | BG Change7.17 | 42 | |
| Instruction-based Image Editing | RISEBench 49 (test) | Reasoning36.67 | 27 | |
| Instruction-based Image Editing | KRIS Bench 38 (test) | Factual Score66.92 | 27 | |
| Text-to-Image Generation | NegGenBench (test) | Positive Score92.5 | 22 | |
| Image Editing | ImgEdit (test) | Add Score3.6 | 16 | |
| Image Editing | GEdit-Bench-EN Intersection | SC Score4.69 | 10 | |
| Image Editing | GEdit-Bench EN Full set | BC Score4.31 | 4 |