Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

About

Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation." The code can be found at https://aka.ms/Kosmos-G

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei• 2023

Related benchmarks

TaskDatasetResultRank
OCR EvaluationOCRBench
Score109
296
Visual Question AnsweringScienceQA
Accuracy29.6
210
Visual UnderstandingMM-Vet
MM-Vet Score11.3
102
Text-to-Image GenerationMS-COCO
FID10.99
75
Subject-driven image generationDreamBench
DINO Score69.4
62
Hallucination and Visual Reasoning EvaluationHallusionBench
Score20.4
37
Visual UnderstandingMME
MME Score104.3
37
Vision UnderstandingMMMU
Overall Score14.8
28
Subject-driven generationDreamBench (test)
DINO Score0.694
25
Multi-modal Visual CapabilityMMStar
Score18.4
20
Showing 10 of 16 rows

Other info

Code

Follow for update