TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

About

Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3.5-based video agents.

Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang• 2024

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	--	563
Video Question Answering	EgoSchema (Full)	Accuracy51	241
Video Captioning	MSR-VTT (test)	CIDEr33.4	142
Video Question Answering	EgoSchema subset	--	124
Video Captioning	VATEX (test)	CIDEr32	66
Video Question Answering	EgoSchema Zero-shot	Accuracy51	11
Video Question Answering	TVQA zero-shot (test)	Accuracy50.2	8
Video Question Answering	NExT-QA zero-shot (test)	Temporal Score57.2	7
Video Question Answering	STAR zero-shot (test)	Interaction Score41.6	7

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord