Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hierarchical Conditional Relation Networks for Video Question Answering

About

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran• 2020

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy36.1
481
Video Question AnsweringMSRVTT-QA (test)
Accuracy35.6
371
Video Question AnsweringMSVD-QA
Accuracy36.1
340
Video Question AnsweringMSVD-QA (test)--
274
Video Question AnsweringNExT-QA (test)
Accuracy48.89
204
Video Question AnsweringNExT-QA (val)
Overall Acc48.2
176
Video Question AnsweringTGIF-QA
Accuracy75
147
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy61.81
96
Video Question AnsweringTGIF-QA (test)
Accuracy81.4
89
Text-to-Video RetrievalMSRVTT 1k (test)
Recall@1066.6
63
Showing 10 of 50 rows

Other info

Code

Follow for update