UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

About

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu• 2025

Related benchmarks

Task	Dataset	Result
Joint audio-video generation	JavisBench 1.0 (test)	AV-IB0.104	18
Text-to-Audio-Video Generation	Verse-Bench	MS0.2	16
Joint text-to-audio-video generation	HDTF and Hallo3 English (test)	FID36.5	12
Joint audio-video generation	JavisBench	Audio-Video Consistency (AV-IB)10.4	12
Multimodal Customization	OC-Bench (test)	Face-Sim0.642	12
Multi-shot Audio-Visual Generation	MAVINSet high-fidelity benchmark 1K-sample (test)	FVD356.9	11
Multi-shot Audio-Visual Generation	MAVINSet subjective 20 samples	AVQ4.6	11
Audio-Video Joint Generation	Audio-Video Generation (test)	LSE-C6.01	7
Joint text-to-audio-video generation	DH-FaceVid-1K Chinese (test)	CER0.715	6
Text-to-Audio-Video Generation	VBench 2.0 (test)	VQ Score (Video Quality)-0.2	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord