[2009.05684] AttnGrounder: Talking to Cars with Attentionopen searchopen navigation menucontact arXivsubscribe to arXiv mailings

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.

Keywords: attention
Date: 2020/09/15 05:21

Related Entries

Read more [1908.10449] Interactive Machine Comprehension with Information Seeking Agents
0 users, 1 mentions 2019/08/29 02:18
Read more NeSy'20 @ IJCLR 2020
0 users, 0 mentions 2020/01/16 18:51
Read more Designing Peptides on a Quantum Computer | bioRxiv
0 users, 1 mentions 2020/03/02 17:21
Read more Designing Peptides on a Quantum Computer | bioRxiv
0 users, 1 mentions 2020/03/04 06:51
Read more [2003.08754] SAM: The Sensitivity of Attribution Methods to Hyperparametersopen searchopen navigatio...
0 users, 1 mentions 2020/05/31 00:52