Unleash the Potential of CLIP for Video Highlight Detection
About
Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
Donghoon Han, Seunghyeon Seo, Eunhwan Park, Seong-Uk Nam, Nojun Kwak• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Moment Retrieval | QVHighlights (test) | -- | 170 | |
| Highlight Detection | QVHighlights (test) | HIT@170.6 | 151 | |
| Moment Retrieval | QVHighlights (val) | -- | 53 | |
| Highlight Detection | QVHighlights (val) | HIT@172.4 | 35 |
Showing 4 of 4 rows