Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval
About
Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Partially Relevant Video Retrieval | TVR | R@115.4 | 16 | |
| Partially Relevant Video Retrieval | ActivityNet Captions | R@18.1 | 16 | |
| Partially Relevant Video Retrieval | TVR M/V Interval (0, 0.2] | SumR184.4 | 12 | |
| Partially Relevant Video Retrieval | TVR M/V Interval (0.2, 0.4] | SumR178.5 | 12 | |
| Partially Relevant Video Retrieval | TVR M/V Interval (0.4, 1] | SumR183.9 | 12 |