Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

About

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen• 2024

Related benchmarks

TaskDatasetResultRank
Moment RetrievalCharades-STA (test)
R@0.559.78
172
Moment RetrievalQVHighlights (test)
R@1 (IoU=0.5)68.03
170
Highlight DetectionQVHighlights (test)
HIT@164.2
151
Video GroundingCharades-STA
R@1 IoU=0.559.8
113
Video Moment RetrievalCharades-STA (test)
Recall@1 (IoU=0.5)59.78
77
Video Moment RetrievalTACOS (test)
Recall@1 (0.5 Threshold)38.72
70
Video GroundingQVHighlights (test)
mAP (IoU=0.5)69.04
64
Moment RetrievalQVHighlights (val)--
53
Video Moment RetrievalCharades-STA
R1@0.559.8
44
Highlight DetectionQVHighlights (val)
HIT@164.13
35
Showing 10 of 17 rows

Other info

Code

Follow for update