Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

About

Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at https://github.com/boyun-zhang/HVP-Net.

Zequn Xie, Boyun Zhang, Yuxiao Lin, Tao Jin• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.571	465
Text-to-Video Retrieval	ActivityNet	R@10.435	245
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1090.6	211
Video-to-Text retrieval	MSR-VTT (1k-A)	Recall@580.5	74

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord