Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Zero-Shot Video Question Answering with Procedural Programs

About

We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023.

Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, L\'aszl\'o A. Jeni• 2023

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA (test)
Accuracy22.1
371
Video Question AnsweringActivityNet-QA (test)
Accuracy42.3
275
Video Question AnsweringMSVD-QA (test)
Accuracy37.5
274
Video Question AnsweringNExT-QA (test)
Accuracy63.8
204
Video Question AnsweringEgoSchema (Full)
Accuracy57.1
193
Video Question AnsweringNExT-QA (val)
Overall Acc64.6
176
Video Question AnsweringTGIF-QA (test)
Accuracy66.1
89
Video Question AnsweringEgoSchema (test)
Accuracy57.1
80
Video Question AnsweringEgoSchema subset
Accuracy57.1
73
Video Question AnsweringiVQA (test)
Accuracy50.7
31
Showing 10 of 13 rows

Other info

Follow for update