Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

About

State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

Susan Liang, Chao Huang, Filippos Bellos, Yolo Yunlong Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu• 2026

Related benchmarks

TaskDatasetResultRank
Audio QualitySora 2
Quality Score3.937
10
Video QualitySora 2
Score4.627
9
Video QualityVeo 3
Score4.643
9
Audio AestheticVeo 3
Aesthetic Score2.87
6
Video-Text AlignmentSora 2
Overall Alignment Score4.817
4
Video-Text AlignmentVeo 3
Score4.72
4
Audio-Text AlignmentSora 2
Overall Alignment Score2.517
4
Audio-Text AlignmentVeo 3
Alignment Score2.183
4
Audio QualityVeo 3
Audio Quality Score3.82
4
Audio-Video AlignmentSora 2 (test)
Overall Score4.18
3
Showing 10 of 15 rows

Other info

Follow for update