Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

About

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

Raymond A. Yeh, Jinjun Xiong, Wen-mei W. Hwu, Minh N. Do, Alexander G. Schwing• 2018

Related benchmarks

TaskDatasetResultRank
Phrase LocalizationFlickr30K Entities (test)--
35
Visual GroundingFlickr30K Entities (test)
Accuracy53.97
29
Phrase groundingFlickr30K
Accuracy53.97
20
Phrase groundingReferIt
Accuracy34.7
14
Referring Expression GroundingRefCLEF (test)
Accuracy34.7
4
Showing 5 of 5 rows

Other info

Follow for update