Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

About

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao• 2026

Related benchmarks

TaskDatasetResultRank
Consistent Video RetrievalCOIN (test)
Accuracy51.64
13
Consistent Video RetrievalCrossTask (test)
Accuracy0.6436
13
Consistency EvaluationDiagnostic (Avg. YouCook2, COIN, CrossTask) (test)
State Accuracy76.92
8
Consistent Video RetrievalYoucook2 (test)
Accuracy75.59
8
Consistent Video RetrievalYouCook2 official (val)
Accuracy44.77
5
Consistent Video RetrievalDiagnostic Average of YouCook2, COIN, CrossTask
State Accuracy53.81
5
Video GenerationYouCook2 (val)
Overall Preference Score55.1
3
Showing 7 of 7 rows

Other info

Follow for update