HyperCLOVA X 8B Omni
About
In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy79.8 | 1455 | |
| Multimodal Understanding | MMBench | Accuracy65.9 | 637 | |
| Multimodal Understanding | MMMU | Accuracy31 | 437 | |
| Video Understanding | MVBench | Accuracy49.5 | 425 | |
| Science Question Answering | ARC Challenge | Accuracy85.8 | 342 | |
| Multimodal Perception | MME Perception | Perception Score1.31e+3 | 79 | |
| Temporal Video Understanding | TempCompass | -- | 68 | |
| Text-to-Speech | LibriSpeech clean (test) | WER7.9 | 66 | |
| Image Editing | ImgEdit 1.0 (test) | -- | 27 | |
| General Knowledge and Reasoning | MMLU | Accuracy75.7 | 24 |