Semi-Autoregressive Transformer for Image Captioning

About

Current state-of-the-art image captioning models adopt autoregressive decoders, \ie they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning~(dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local . Based on Transformer, there are only a few modifications needed to implement SATIC. Experimental results on the MSCOCO image captioning benchmark show that SATIC can achieve a good trade-off without bells and whistles. Code is available at {\color{magenta}\url{https://github.com/YuanEZhou/satic}}.

Yuanen Zhou, Yong Zhang, Zhenzhen Hu, Meng Wang• 2021

Related benchmarks

Task	Dataset	Result	Rank
Image Captioning	MS COCO Karpathy (test)	CIDEr1.272		706
Image Captioning	COCO (Karpathy split)	CIDEr111		74

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord