Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

About

As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.

Zhijiang Tang, Jiaxin Qi, Bing Zhao, Jianqiang Huang• 2026

Related benchmarks

TaskDatasetResultRank
Long Video Quality EvaluationVGoT
Spearman Correlation0.708
12
Long Video Quality EvaluationHoloCine
Spearman Correlation0.571
12
Long Video Quality EvaluationWan 2.2
Spearman Correlation0.835
12
Long Video Quality EvaluationStoryMem
Spearman Correlation0.831
12
Long Video Quality EvaluationVeo 3.1
Spearman Correlation0.796
12
Long Video Quality EvaluationSora 2
Spearman Corr0.733
12
Long Video Quality EvaluationOverall Aggregated Models
Spearman's Correlation0.765
12
Showing 7 of 7 rows

Other info

Follow for update