Hierarchical Memory for Long Video QA

About

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .

Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin• 2024

Related benchmarks

Task	Dataset	Result	Rank
Long Video Question Answering	MovieChat-1K Global Mode (test)	Accuracy84		24
Long Video Question Answering	MovieChat-1K Breakpoint Mode (test)	Accuracy73.5		24

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord