Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

About

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu• 2025

Related benchmarks

TaskDatasetResultRank
Joint audio-video generationJavisBench 1.0 (test)
AV-IB0.104
18
Text-to-Audio-Video GenerationVerse-Bench
MS0.2
16
Joint text-to-audio-video generationHDTF and Hallo3 English (test)
FID36.5
12
Joint audio-video generationJavisBench
Audio-Video Consistency (AV-IB)10.4
12
Multimodal CustomizationOC-Bench (test)
Face-Sim0.642
12
Audio-Video Joint GenerationAudio-Video Generation (test)
LSE-C6.01
7
Joint text-to-audio-video generationDH-FaceVid-1K Chinese (test)
CER0.715
6
Text-to-Audio-Video GenerationVBench 2.0 (test)
VQ Score (Video Quality)-0.2
6
Joint audio-video generationCustom human-centric audio-video real and anime images (test)
PQ4.56
6
Text-and-Image to Audio-Visual GenerationAudio-Visual Generation Benchmark (test)
VA3.77
6
Showing 10 of 16 rows

Other info

Follow for update