Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
About
Text-to-video generation has made significant strides, but replicating the capabilities of advanced systems like OpenAI Sora remains challenging due to their closed-source nature. Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality. In this paper, we introduce Mora, a novel multi-agent framework that leverages existing open-source modules to replicate Sora functionalities. We address these fundamental limitations by proposing three key techniques: (1) multi-agent fine-tuning with a self-modulation factor to enhance inter-agent coordination, (2) a data-free training strategy that uses large models to synthesize training data, and (3) a human-in-the-loop mechanism combined with multimodal large language models for data filtering to ensure high-quality training datasets. Our comprehensive experiments on six video generation tasks demonstrate that Mora achieves performance comparable to Sora on VBench, outperforming existing open-source methods across various tasks. Specifically, in the text-to-video generation task, Mora achieved a Video Quality score of 0.800, surpassing Sora 0.797 and outperforming all other baseline models across six key metrics. Additionally, in the image-to-video generation task, Mora achieved a perfect Dynamic Degree score of 1.00, demonstrating exceptional capability in enhancing motion realism and achieving higher Imaging Quality than Sora. These results highlight the potential of collaborative multi-agent systems and human-in-the-loop mechanisms in advancing text-to-video generation. Our code is available at \url{https://github.com/lichao-sun/Mora}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Generation | VBench | Overall Score97.4 | 35 | |
| Long Video Generation | StoryMem | StoryMem Score99 | 15 | |
| Long Video Generation | ViStory Self-Consistency | ViStory-Self Score0.855 | 15 | |
| Long Video Generation | ViStory (Cross-Frame) | ViStory-Cross33.1 | 15 | |
| Long Video Generation | MovieBench | MovieBench Score26.062 | 15 | |
| Long Video Generation | MSVE-Bench | MSVE-Bench (NB-Q)27.6 | 15 |