Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HyperCLOVA X 8B Omni

About

In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.

NAVER Cloud HyperCLOVA X Team• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy79.8
1455
Multimodal UnderstandingMMBench
Accuracy65.9
637
Multimodal UnderstandingMMMU
Accuracy31
437
Video UnderstandingMVBench
Accuracy49.5
425
Science Question AnsweringARC Challenge
Accuracy85.8
342
Multimodal PerceptionMME Perception
Perception Score1.31e+3
79
Temporal Video UnderstandingTempCompass--
68
Text-to-SpeechLibriSpeech clean (test)
WER7.9
66
Image EditingImgEdit 1.0 (test)--
27
General Knowledge and ReasoningMMLU
Accuracy75.7
24
Showing 10 of 17 rows

Other info

Follow for update