Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Auto-Encoding Scene Graphs for Image Captioning

About

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph ($\mathcal{G}$) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image ($\mathcal{I}$) and sentence ($\mathcal{S}$). In the textual domain, we use SGAE to learn a dictionary ($\mathcal{D}$) that helps to reconstruct sentences in the $\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow \mathcal{S}$ pipeline, where $\mathcal{D}$ encodes the desired language prior; in the vision-language domain, we use the shared $\mathcal{D}$ to guide the encoder-decoder in the $\mathcal{I}\rightarrow \mathcal{G}\rightarrow \mathcal{D} \rightarrow \mathcal{S}$ pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art $127.8$ CIDEr-D on the Karpathy split, and a competitive $125.5$ CIDEr-D (c40) on the official server even compared to other ensemble models.

Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai• 2018

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.291
682
Image CaptioningMS-COCO (test)--
117
Image CaptioningMS COCO (Karpathy)
CIDEr-D127.8
56
Image CaptioningMS-COCO online (test)
BLEU-4 (c5)38.5
49
Image CaptioningMS-COCO 2014 (test)
BLEU-469.7
43
Image CaptioningCOCO c5 references online (test)
BLEU-181
24
Image CaptioningMSCOCO (test server)
BLEU-4 (c5)38.5
22
Image CaptioningMS COCO 40,775 images (test)
CIDEr126.5
15
Image CaptioningCOCO (test)
METEOR28.4
9
Showing 9 of 9 rows

Other info

Follow for update