OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
About
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | -- | 682 | |
| Text-to-Image Generation | GenEval | GenEval Score62 | 277 | |
| Audio Captioning | AudioCaps (test) | CIDEr48 | 140 | |
| Text-to-Audio Generation | AudioCaps (test) | FAD1.75 | 138 | |
| Text-to-Image Generation | MS-COCO 30K (test) | FID13.4 | 41 | |
| Text-to-Image Generation | evaluation benchmarks one-to-one | CLIP Score31.52 | 6 | |
| Text-to-Audio Generation | One-to-one evaluation benchmarks Text-to-Audio | FAD4.2 | 6 | |
| Text-to-Audio Generation | evaluation benchmarks one-to-one | CLAP Score24.23 | 6 | |
| Text-to-Image Generation | One-to-one evaluation benchmarks Text-to-Image | FID22.97 | 6 | |
| Audio-to-Text Generation | one-to-one evaluation benchmarks | CLAP Score45.08 | 5 |