Kling-Omni Technical Report
About
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction-only Video Editing (Add) | VIE-Bench | Instruction Following9.333 | 15 | |
| Video Editing | VIE-Bench | Instruction Following9.378 | 11 | |
| Video Editing | VIE-Bench Swap Change | Instruction Following Score9.495 | 10 | |
| Video Editing | VIE-Bench Style Tone Change | Instruction Following Score9.867 | 7 | |
| Subject-driven video generation | Subject-to-Video (S2V) (test) | MS0.4965 | 5 | |
| Video Editing | RefVIE-Bench | Identity Consistency4.75 | 4 |