Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

About

As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.

Zhijiang Tang, Jiaxin Qi, Bing Zhao, Jianqiang Huang• 2026

Related benchmarks

Task	Dataset	Result
Long Video Quality Evaluation	VGoT	Spearman Correlation0.708	12
Long Video Quality Evaluation	HoloCine	Spearman Correlation0.571	12
Long Video Quality Evaluation	Wan 2.2	Spearman Correlation0.835	12
Long Video Quality Evaluation	StoryMem	Spearman Correlation0.831	12
Long Video Quality Evaluation	Veo 3.1	Spearman Correlation0.796	12
Long Video Quality Evaluation	Sora 2	Spearman Corr0.733	12
Long Video Quality Evaluation	Overall Aggregated Models	Spearman's Correlation0.765	12

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord