Video Understanding: Through A Temporal Lens

About

This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.

Thong Thanh Nguyen• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.627	465
Video Question Answering	ActivityNet-QA	Accuracy58.3	418
Video Question Answering	MSRVTT-QA (test)	Accuracy46.4	376
Video Question Answering	MSVD-QA (test)	Accuracy55.8	279
Text-to-Video Retrieval	ActivityNet	R@10.6	245
Video Captioning	MSVD	CIDEr59.4	157
Video Question Answering	MSVD	Accuracy79.5	152
Text-to-Video Retrieval	MSRVTT	R@160	144
Video-to-Text retrieval	ActivityNet	R@10.523	136
Video-to-Text retrieval	DiDeMo	R@148.1	136

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord