No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

About

We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.

Tanmay Gupta, Alexander Schwing, Derek Hoiem• 2018

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Detection	HICO-DET (test)	mAP (full)20.41	544
Human-Object Interaction Detection	V-COCO (test)	AP (Role, Scenario 1)53.1	270
Human-Object Interaction Detection	HICO-DET	mAP (Full)17.18	263
HOI Detection	HICO-DET (test)	Box mAP (Full)17.18	32
Human-Object Interaction Detection	V-COCO	AP (Role)31.8	23
Human-Object Interaction Detection	HICO-DET 9 (test)	mAP (Full)17.18	21
Human-Object Interaction Detection	HOI-VP	mAP61.05	11

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord