Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hierarchical Memory for Long Video QA

About

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .

Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin• 2024

Related benchmarks

TaskDatasetResultRank
Long Video Question AnsweringMovieChat-1K Global Mode (test)
Accuracy84
24
Long Video Question AnsweringMovieChat-1K Breakpoint Mode (test)
Accuracy73.5
24
Showing 2 of 2 rows

Other info

Follow for update