HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
About
Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Video Grounding | ActivityNet-Captions (test) | -- | 32 | |
| Video Grounding | Ego4D-NLQ v1 (test) | Recall@1 (Avg)19.55 | 27 | |
| Video Grounding | MAD 1.0 (test) | R@1 (IoU=0.3)11.26 | 26 | |
| Temporal Grounding | Ego4D-NLQ | R@1 (IoU=0.3)18.81 | 25 | |
| Natural Language Video Grounding | TACoS (val) | Recall@1 (IoU=0.3)59.59 | 16 | |
| Video Temporal Grounding | Ego4D NLQ v1 (val) | R@1 (IoU=0.3)18.81 | 12 | |
| Video Temporal Grounding | TACOS | Recall@1 (IoU=0.3)59.59 | 12 | |
| Long Video Temporal Grounding | Ego4D | Average Recall25.66 | 9 | |
| Video Temporal Grounding | Ego4D NLQ (test) | Recall@5 (IoU=0.3)40.82 | 8 | |
| Video Temporal Grounding | Charades-STA (test) | Avg R@162.35 | 6 |