Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

About

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https://github.com/MCG-NJU/MMN.

Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, Gangshan Wu• 2021

Related benchmarks

TaskDatasetResultRank
Moment RetrievalCharades-STA (test)
R@0.546.93
172
Video Moment RetrievalCharades-STA (test)
Recall@1 (IoU=0.5)47.31
77
Video Moment RetrievalTACOS (test)
Recall@1 (0.5 Threshold)26.17
70
Temporal GroundingCharades-STA (test)
Recall@1 (IoU=0.5)47.31
68
Temporal GroundingActivityNet Captions
Recall@1 (IoU=0.5)48.59
45
Video GroundingTACOS
Recall@1 (IoU=0.5)26.17
45
Video Moment RetrievalCharades-STA
R1@0.546.93
44
Video GroundingActivityNet Captions
R@1 (IoU=0.5)48.59
43
Spatio-Temporal Video GroundingHCSTVG v2 (val)
m_vIoU30.3
38
Spatio-Temporal Video GroundingHC-STVG (val)
Mean vIoU30.32
19
Showing 10 of 23 rows

Other info

Code

Follow for update