Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

About

Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3.5-based video agents.

Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang• 2024

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
247
Video Question AnsweringEgoSchema (Full)
Accuracy51
193
Video CaptioningMSR-VTT (test)
CIDEr33.4
121
Video Question AnsweringEgoSchema subset--
73
Video CaptioningVATEX (test)
CIDEr32
59
Video Question AnsweringEgoSchema Zero-shot
Accuracy51
11
Video Question AnsweringTVQA zero-shot (test)
Accuracy50.2
8
Video Question AnsweringNExT-QA zero-shot (test)
Temporal Score57.2
7
Video Question AnsweringSTAR zero-shot (test)
Interaction Score41.6
7
Showing 9 of 9 rows

Other info

Code

Follow for update