Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

About

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, Jae-Pil Heo• 2023

Related benchmarks

TaskDatasetResultRank
Moment RetrievalCharades-STA (test)
R@0.557.31
172
Moment RetrievalQVHighlights (test)
R@1 (IoU=0.5)64.1
170
Highlight DetectionQVHighlights (test)
HIT@164.2
151
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.557.31
117
Video Moment RetrievalCharades-STA (test)
Recall@1 (IoU=0.5)57.31
77
Video GroundingQVHighlights (test)
mAP (IoU=0.5)63.37
64
Moment RetrievalQVHighlights (val)
R@1 (IoU=0.5)62.9
53
Video Moment RetrievalCharades-STA
R1@0.557.31
44
Highlight DetectionYouTube Highlights (test)
mAP (Dog)72.2
42
Highlight DetectionQVHighlights (val)
HIT@163.03
35
Showing 10 of 32 rows

Other info

Code

Follow for update