Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

About

This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.

Arun Mallya, Svetlana Lazebnik• 2016

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Detection	HICO-DET (test)	--	544
Human-Object Interaction Detection	HICO-DET	--	263
Activity Recognition	MPII (test)	mAP32.24	20
Multi-label HOI classification	HICO	mAP36.1	10
HOI Classification	HICO (test)	mAP36.1	10
Activity Prediction	HICO (test)	mAP36.1	9
Pair's Relationship	MadLibs Easy (test)	Accuracy78.5	7
Pair's Relationship	MadLibs Hard (test)	Accuracy56.17	7
Pair's Relationship	MadLibs Filtered Hard (test)	Accuracy62.06	7
Person's Activity	MadLibs Easy (test)	Accuracy87.57	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord