Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

About

Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics ReTRiever (SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43% and memory usage by 46%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu• 2024

Related benchmarks

TaskDatasetResultRank
Video Action ClassificationCOIN
Top-1 Acc93.5
33
Video Question AnsweringMovieChat-1k Breakpoint
Accuracy65.8
23
Long-form Video ClassificationBreakfast
Top-1 Accuracy95.2
14
Video Question AnsweringMovieChat Global Breakpoint
Breakpoint Accuracy57.3
14
Long-form Video ClassificationLVU
Relation Accuracy67.6
10
Video Question AnsweringMovieChat 1K (test)
Accuracy84.9
7
Video Question AnsweringMovieChat-1k Global
Accuracy84.9
6
Showing 7 of 7 rows

Other info

Code

Follow for update