OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
About
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. Our framework supports three key capabilities: (1) Text-conditioned video generation, where all modalities are jointly synthesized from a textual prompt; (2) Video understanding, where structural modalities are predicted from rgb inputs in a coherent manner; and (3) X-conditioned video generation, where video synthesis is guided by finegrained inputs such as depth, canny and segmentation. Extensive experiments demonstrate that OmniVDiff achieves state-of-the-art performance in video generation tasks and competitive results in video understanding. Its flexibility and scalability make it well-suited for downstream applications such as video-to-video translation, modality adaptation for visual tasks, and scene reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VideoPhy | -- | 41 | |
| Physical Plausibility Evaluation | VideoPhy | Average PC11 | 16 | |
| Text-to-Video Generation | PhyGenBench | Mec Score5 | 12 |