Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

About

State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

Susan Liang, Chao Huang, Filippos Bellos, Yolo Yunlong Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu• 2026

Related benchmarks

Task	Dataset	Result
Audio Quality	Sora 2	Quality Score3.937	10
Video Quality	Sora 2	Score4.627	9
Video Quality	Veo 3	Score4.643	9
Audio Aesthetic	Veo 3	Aesthetic Score2.87	6
Video-Text Alignment	Sora 2	Overall Alignment Score4.817	4
Video-Text Alignment	Veo 3	Score4.72	4
Audio-Text Alignment	Sora 2	Overall Alignment Score2.517	4
Audio-Text Alignment	Veo 3	Alignment Score2.183	4
Audio Quality	Veo 3	Audio Quality Score3.82	4
Audio-Video Alignment	Sora 2 (test)	Overall Score4.18	3

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord