Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

About

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover• 2024

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)--
682
Text-to-Image GenerationGenEval
GenEval Score62
277
Audio CaptioningAudioCaps (test)
CIDEr48
140
Text-to-Audio GenerationAudioCaps (test)
FAD1.75
138
Text-to-Image GenerationMS-COCO 30K (test)
FID13.4
41
Text-to-Image Generationevaluation benchmarks one-to-one
CLIP Score31.52
6
Text-to-Audio GenerationOne-to-one evaluation benchmarks Text-to-Audio
FAD4.2
6
Text-to-Audio Generationevaluation benchmarks one-to-one
CLAP Score24.23
6
Text-to-Image GenerationOne-to-one evaluation benchmarks Text-to-Image
FID22.97
6
Audio-to-Text Generationone-to-one evaluation benchmarks
CLAP Score45.08
5
Showing 10 of 24 rows

Other info

Code

Follow for update