VRAG: Region Attention Graphs for Content-Based Video Retrieval
About
Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval. We evaluate our VRAG over several video retrieval tasks and achieve a new state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG shows higher retrieval precision than other existing video-level methods, and closer performance to frame-level methods at faster evaluation speeds. Finally, our code will be made publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Event Video Retrieval | EVVE | mAP65.3 | 52 | |
| Video Retrieval | FIVR-200K | DSVR0.484 | 45 | |
| Near-duplicate video retrieval | CC_WEB_VIDEO Standard | mAP97.6 | 26 | |
| Near-duplicate video retrieval | CC_WEB_VIDEO | -- | 25 | |
| Near-duplicate video retrieval | CC_WEB_VIDEO Cleaned | mAP98.7 | 24 | |
| Near-duplicate video retrieval | CC_WEB_VIDEO original and cleaned (full dataset) | CC Score97.1 | 17 | |
| Video Retrieval | FIVR 5K (query) | Inference Time (s/q)0.79 | 15 | |
| Video Retrieval | FIVR-200K original (test) | DSVR0.723 | 12 | |
| Event Video Retrieval | FIVR5K | DSVR70.9 | 10 |