[1909.04402] Compositional Generalization in Image Captioning

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

2 mentions: @delliott@supernlpblog
Date: 2019/09/11 12:49

Referring Tweets

@supernlpblog Can SOA image captioning models behave compositionally to generalise to unseen combinations of concepts? (Not really!) See details/results of this analysis and one way of fixing this CoNLL 19 pre-print: https://t.co/TcWs66VMWj https://t.co/IHbwLJmkdV
@delliott Compositional Generalization in Image Captioning accepted at #CoNLL2019 with @mitjanikolaus, Mostafa Abdou, @rahul_a_r, and @MatthewRLamm. We show that jointly learning image captioning and image--sentence ranking improves generalization performance https://t.co/QK1JNwM7sr https://t.co/4AI9dATCHy