Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

About

Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline by achieving a relative gain of 29% in the overall score, which averages visual consistency and text adherence using human evaluation.

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench
Quality Score62.5
111
Video GenerationUser Study (test)
Video Quality Score12.31
8
Multi-scene video generationMulti-scene evaluation dataset 1.0 (test)
Visual Consistency67.47
5
Auto-regressive scene extensionT2V-CompBench
Action Binding Score2.15
5
Auto-regressive scene extensionEvalCrafter
VQA_A3.72
5
Showing 5 of 5 rows

Other info

Follow for update