Chrono: A Simple Blueprint for Representing Time in MLLMs

About

The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain, developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. We find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to any image-text pretrained MLLM. In extensive experiments spanning different MLLM architectures and sizes, finetuning and zero-shot settings, we demonstrate new state-of-the-art results in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions, as well as in grounded video question answering on NExT-GQA.

Hector Rodriguez, Boris Meinardus, Anil Batra, Anna Rohrbach, Marcus Rohrbach• 2024

Related benchmarks

Task	Dataset	Result
Moment Retrieval	QVHighlights (test)	R@1 (IoU=0.5)74.77	223
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.569.3	124
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)41.4	85
Video Moment Retrieval	Charades-STA	R1@0.569.31	57
Temporal Video Grounding	ActivityNet-Captions (test)	Recall@IoU>0.553.9	32
Video Temporal Grounding	QVHighlights	R1@0.581.8	23
Moment Retrieval	QVHighlights v1 (test)	R1@0.574.77	19
Video Moment Retrieval	ActivityNet-Captions (test)	R1@0.553.92	19
Video Grounding	ActivityNet-Captions (test)	R@1 (IoU=0.5)53.9	15
Video Temporal Grounding	DiDeMo (test)	Recall@1 (IoU=0.3)56	11

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord