Learning to count objects in natural images for visual question answering
Learning to count objects in natural images for visual question answering
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
vqa
1-17
Zhang, Yan
0edf84ab-1e32-4239-bef6-7fe80d6bc7a7
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Prügel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e
30 April 2018
Zhang, Yan
0edf84ab-1e32-4239-bef6-7fe80d6bc7a7
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Prügel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e
Zhang, Yan, Hare, Jonathon and Prügel-Bennett, Adam
(2018)
Learning to count objects in natural images for visual question answering.
International Conference on Learning Representations, Vancouver Convention Center, Vancouver, Canada.
30 Apr - 03 May 2018.
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
Text
learning_to_count_objects_in_natural_images_for_visual_question_answering
- Version of Record
Restricted to Repository staff only
Request a copy
More information
Accepted/In Press date: 29 January 2018
e-pub ahead of print date: 19 February 2018
Published date: 30 April 2018
Venue - Dates:
International Conference on Learning Representations, Vancouver Convention Center, Vancouver, Canada, 2018-04-30 - 2018-05-03
Keywords:
vqa
Identifiers
Local EPrints ID: 418094
URI: http://eprints.soton.ac.uk/id/eprint/418094
PURE UUID: 609cde96-c244-4978-a283-63decb049b91
Catalogue record
Date deposited: 22 Feb 2018 17:30
Last modified: 16 Mar 2024 03:50
Export record
Contributors
Author:
Yan Zhang
Author:
Jonathon Hare
Author:
Adam Prügel-Bennett
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics