Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

About

Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature leverage large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Motivated by this inefficiency, we propose LVNet, a modular and training-free framework featuring a novel Hierarchical Keyframe Selector (HKS) that efficiently selects a minimal set of informative frames tailored to each question. LVNet's modularity allows easy integration with existing approaches for more efficient LVQA. We achieve state-of-the-art performance among similarly configured models across four benchmark LVQA datasets: EgoSchema, NExT-QA, IntentQA, VideoMME. The code can be found at https://github.com/jongwoopark7978/LVNet

Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryu, Donghyun Kim, Michael S. Ryoo• 2024

Related benchmarks

Task	Dataset	Result
Video Question Answering	EgoSchema (Full)	Accuracy61.1	241
Video Question Answering	NExT-QA (test)	Accuracy72.9	204
Video Question Answering	NExT-QA (val)	Overall Acc72.9	176
Video Question Answering	EgoSchema subset	Accuracy68.2	124
Video Question Answering	EgoSchema 500-question subset	Accuracy68.2	50
Video Question Answering	NExT-QA Main Dataset	Accuracy0.729	48
Video Question Answering	IntentQA	Accuracy (All)71.7	35
Video Question Answering	EgoSchema 5031 videos (test)	Top-1 Accuracy61.1	26
Video Question Answering	Next-QA v1 (test)	Overall Acc72.9	24
Video Reasoning	EgoSchema (test)	Accuracy58.8	23

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord