HyperCLOVA X 8B Omni

About

In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.

NAVER Cloud HyperCLOVA X Team• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy79.8	2056
Multimodal Understanding	MMBench	Accuracy65.9	887
Video Understanding	MVBench	Accuracy49.5	635
Multimodal Understanding	MMMU	Accuracy31	437
Science Question Answering	ARC Challenge	Accuracy85.8	354
Temporal Video Understanding	TempCompass	--	160
Automatic Speech Recognition	LibriSpeech Other	WER5.03	140
Automatic Speech Recognition	LibriSpeech Clean	WER2.28	124
Question Answering	MMLU-Pro	Accuracy53.79	103
Multimodal Perception	MME Perception	Perception Score1.31e+3	99

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord