Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

About

Most models tasked to ground referential utterances in 2D and 3D scenes learn to select the referred object from a pool of object proposals provided by a pre-trained detector. This is limiting because an utterance may refer to visual entities at various levels of granularity, such as the chair, the leg of the chair, or the tip of the front leg of the chair, which may be missed by the detector. We propose a language grounding model that attends on the referential utterance and on the object proposal pool computed from a pre-trained detector to decode referenced objects with a detection head, without selecting them from the pool. In this way, it is helped by powerful pre-trained object detectors without being restricted by their misses. We call our model Bottom Up Top Down DEtection TRansformers (BUTD-DETR) because it uses both language guidance (top down) and objectness guidance (bottom-up) to ground referential utterances in images and point clouds. Moreover, BUTD-DETR casts object detection as referential grounding and uses object labels as language prompts to be grounded in the visual scene, augmenting supervision for the referential grounding task in this way. The proposed model sets a new state-of-the-art across popular 3D language grounding benchmarks with significant performance gains over previous 3D approaches (12.6% on SR3D, 11.6% on NR3D and 6.3% on ScanRefer). When applied in 2D images, it performs on par with the previous state of the art. We ablate the design choices of our model and quantify their contribution to performance. Our code and checkpoints can be found at the project website https://butd-detr.github.io.

Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki• 2021

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5039.8
155
3D Visual GroundingNr3D (test)
Overall Success Rate54.6
88
3D Visual GroundingNr3D
Overall Success Rate54.6
74
3D Visual GroundingSr3D (test)
Overall Accuracy67
73
3D Visual GroundingScanRefer Unique
Acc@0.25 (IoU=0.25)84.2
24
3D Visual GroundingScanRefer (test)
Unique Accuracy84.2
21
3D Visual GroundingScanRefer Multiple (val)
Accuracy @ IoU 0.2546.6
15
3D Visual GroundingNr3D (val)
Easy Score60.7
13
3D referring expression comprehensionSR3D ReferIt3D (test)
Overall Accuracy67
11
Referring Expression SegmentationScanRefer
mIoU35.4
9
Showing 10 of 18 rows

Other info

Follow for update