RLIPv2: Fast Scaling of Relational Language-Image Pre-training

About

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao• 2023

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Detection	HICO-DET (test)	mAP (full)45.1	544
Human-Object Interaction Detection	V-COCO (test)	AP (Role, Scenario 1)72.1	270
Human-Object Interaction Detection	HICO-DET	mAP (Full)45.09	263
Scene Graph Generation	Open Images v6 (test)	wmAPrel56.38	74
Human-Object Interaction Detection	V-COCO	AP^1 Role72.1	65
Human-Object Interaction Detection	V-COCO	AP Role (Scenario 1)68.8	53
HOI Detection	V-COCO	AP Role 172.1	40
HOI Detection	HICO-DET	mAP (Rare)43.23	34
Human-Object Interaction Detection	HICO-DET 1 (test)	Full mAP32.22	33
HOI Detection	HICO-DET (test)	Box mAP (Full)45.1	32

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord