Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

About

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao• 2023

Related benchmarks

TaskDatasetResultRank
Human-Object Interaction DetectionHICO-DET (test)
mAP (full)45.1
493
Human-Object Interaction DetectionV-COCO (test)
AP (Role, Scenario 1)72.1
270
Human-Object Interaction DetectionHICO-DET
mAP (Full)45.09
233
Scene Graph GenerationOpen Images v6 (test)
wmAPrel56.38
74
Human-Object Interaction DetectionV-COCO
AP^1 Role72.1
65
HOI DetectionV-COCO
AP Role 172.1
40
HOI DetectionHICO-DET
mAP (Rare)43.23
34
Human-Object Interaction DetectionHICO-DET 1 (test)
Full mAP32.22
33
HOI DetectionHICO-DET (test)
Box mAP (Full)45.1
32
Human-Object Interaction DetectionV-COCO
Box mAP (Scenario 1)72.1
32
Showing 10 of 20 rows

Other info

Code

Follow for update