VideoPoet: A Large Language Model for Zero-Shot Video Generation
About
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.3123 | 85 | |
| Text-to-Video Generation | UCF-101 | FVD355 | 61 | |
| Video Generation | Physics-IQ | Phys. IQ Score29.5 | 45 | |
| Text-to-Video Generation | MSR-VTT | CLIPSIM0.3049 | 28 | |
| Physical Plausibility Evaluation | Physics-IQ (modified) | Solid Mechanics Score35.1 | 6 | |
| Video Stylization | DAVIS 2016 (val) | CLIPSIM0.3417 | 2 |