The University of Southampton
University of Southampton Institutional Repository

Learning to count objects in natural images for visual question answering

Learning to count objects in natural images for visual question answering
Learning to count objects in natural images for visual question answering
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
vqa
1-17
Zhang, Yan
0edf84ab-1e32-4239-bef6-7fe80d6bc7a7
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Prügel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e
Zhang, Yan
0edf84ab-1e32-4239-bef6-7fe80d6bc7a7
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Prügel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e

Zhang, Yan, Hare, Jonathon and Prügel-Bennett, Adam (2018) Learning to count objects in natural images for visual question answering. International Conference on Learning Representations, Vancouver Convention Center, Vancouver, Canada. 30 Apr - 03 May 2018. pp. 1-17 .

Record type: Conference or Workshop Item (Paper)

Abstract

Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.

Text
learning_to_count_objects_in_natural_images_for_visual_question_answering - Version of Record
Restricted to Repository staff only
Request a copy

More information

Accepted/In Press date: 29 January 2018
e-pub ahead of print date: 19 February 2018
Published date: 30 April 2018
Venue - Dates: International Conference on Learning Representations, Vancouver Convention Center, Vancouver, Canada, 2018-04-30 - 2018-05-03
Keywords: vqa

Identifiers

Local EPrints ID: 418094
URI: http://eprints.soton.ac.uk/id/eprint/418094
PURE UUID: 609cde96-c244-4978-a283-63decb049b91
ORCID for Yan Zhang: ORCID iD orcid.org/0000-0003-3470-3663
ORCID for Jonathon Hare: ORCID iD orcid.org/0000-0003-2921-4283

Catalogue record

Date deposited: 22 Feb 2018 17:30
Last modified: 27 Feb 2021 02:49

Export record

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×