Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

About

Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

Joungbin An, Kristen Grauman• 2025

Related benchmarks

TaskDatasetResultRank
Temporal Video GroundingActivityNet-Captions (test)--
32
Video GroundingEgo4D-NLQ v1 (test)
Recall@1 (Avg)19.55
27
Video GroundingMAD 1.0 (test)
R@1 (IoU=0.3)11.26
26
Temporal GroundingEgo4D-NLQ
R@1 (IoU=0.3)18.81
25
Natural Language Video GroundingTACoS (val)
Recall@1 (IoU=0.3)59.59
16
Video Temporal GroundingEgo4D NLQ v1 (val)
R@1 (IoU=0.3)18.81
12
Video Temporal GroundingTACOS
Recall@1 (IoU=0.3)59.59
12
Long Video Temporal GroundingEgo4D
Average Recall25.66
9
Video Temporal GroundingEgo4D NLQ (test)
Recall@5 (IoU=0.3)40.82
8
Video Temporal GroundingCharades-STA (test)
Avg R@162.35
6
Showing 10 of 12 rows

Other info

Follow for update