Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding
About
Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \emph{when}, \emph{where}, \emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | Video-MME | Overall Score66.2 | 96 | |
| Video Question Answering | EgoSchema 3 min (test) | Accuracy65.6 | 18 | |
| Long Video Understanding | LVBench 1.0 (test) | Overall Score45.7 | 13 | |
| Long Video Understanding | HourVideo 1.0 (test) | Overall Score35.1 | 12 | |
| Long Video Understanding | HourVideo | Overall Score35.1 | 12 |