Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

About

Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench & UniBench Dataset
Background Consistency97.44
6
Depth EstimationUniBench
Abs Rel0.022
4
Video GenerationWISA-80K subset of 12 randomly selected videos
Physical Quality38.5
4
Controllable Video GenerationVBench & UniBench
Background Consistency96.04
3
Video segmentationUniBench Dataset
mIoU68.82
3
Showing 5 of 5 rows

Other info

GitHub

Follow for update