Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

About

Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.

Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu• 2026

Related benchmarks

TaskDatasetResultRank
Highlight DetectionQVHighlights (test)
HIT@152.5
151
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.544.8
117
Multi-modal Video UnderstandingMVBench--
39
Dense Video CaptioningYouCook2
SODA_c1.6
29
Video GroundingE.T. Bench-Grounding (test)
TVG F152
19
Dense Video CaptioningE.T.Bench
DVC F141.6
14
Showing 6 of 6 rows

Other info

Follow for update