Learning Video Representations from Large Language Models

About

We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

Yue Zhao, Ishan Misra, Philipp Kr\"ahenb\"uhl, Rohit Girdhar• 2022

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.566	465
Video Question Answering	MSRVTT-QA (test)	Accuracy44.9	376
Video Question Answering	MSVD-QA (test)	Accuracy53.7	279
Text-to-Video Retrieval	ActivityNet	R@10.587	245
Text-to-Video Retrieval	MSRVTT	R@156	144
Video-to-Text retrieval	ActivityNet	R@10.503	136
Video-to-Text retrieval	DiDeMo	R@147.1	136
Action Recognition	EPIC-KITCHENS (val)	Verb Top-1 Acc72	36
Action Recognition	Epic Kitchens 100	--	26
Video-to-Text retrieval	MSRVTT	R@149	24

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord