Kling-Omni Technical Report
About
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Editing | VIE-Bench | Instruction Following9.378 | 18 | |
| Instruction-only Video Editing (Add) | VIE-Bench | Instruction Following9.333 | 15 | |
| Video Editing | VIE-Bench Swap Change | Instruction Following Score9.495 | 10 | |
| Controllable Video Generation | CogControlBench | AQ57.1 | 9 | |
| Video Editing | VIE-Bench Style Tone Change | Instruction Following Score9.867 | 7 | |
| Video Stylization | Public Bench | CSD0.416 | 7 | |
| Video Stylization | VISTA Bench | CSD0.448 | 7 | |
| Unseen Robot Adaptation | Synthetic Held-out Embodiment Benchmark | PSNR22.7 | 6 | |
| Text-Guided Video Effect Generation | VideoEffect 130K (test) | V-Consistency Score8.61 | 5 | |
| Unseen Robot Adaptation | Real-world benchmark | Motion Consistency7.49 | 5 |