Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

About

Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim• 2023

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA
Accuracy48.6
319
Video Question AnsweringNExT-QA (test)
Accuracy72
204
Video Question AnsweringNExT-QA (val)
Overall Acc72
176
Video Question AnsweringNEXT-QA
Overall Accuracy75.5
105
Video Question AnsweringNExT-QA Main Dataset
Accuracy0.755
48
Video Question AnsweringTVQA
Accuracy82.2
40
Video Question AnsweringNext-QA v1 (test)
Overall Acc72
24
Video Question AnsweringVLEP
Total Accuracy71
8
Video Question AnsweringNEXT-QA open-form generation
WUPS34.3
5
Video Question AnsweringDramaQA
Total Accuracy84.1
5
Showing 10 of 10 rows

Other info

Code

Follow for update