GRiT: A Generative Region-to-text Transformer for Object Understanding

About

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang• 2022

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP60.3	2843
Object Detection	COCO v2017 (test-dev)	mAP60.4	499
Object Detection	COCO	mAP60.4	137
Object Detection	COCO-O	AP42.9	35
Referring expression generation	RefCOCOg (val)	METEOR15.3	31
Dense Captioning	VG V1.2	mAP40.3	25
Region-level captioning	RefCOCOg	METEOR15.2	25
Region-level captioning	RefCOCOg (test)	CIDEr71.6	18
Region Captioning	Visual Genome	METEOR17.1	18
Dense Captioning	Visual Genome	mAP15.52	16

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord