Foveated convolutions: improving spatial transformer networks by modelling the retina
Foveated convolutions: improving spatial transformer networks by modelling the retina
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convolutional neural networks in a range of tasks. By ‘focusing’ on the salient parts of the input using a differentiable affine transform, a network augmented with an STN should have increased performance, efficiency and interpretability. However, in practice, STNs rarely exhibit these desiderata, instead converging to a seemingly meaningless transformation of the input. We demonstrate and characterise this localisation problem as deriving from the spatial invariance of feature detection layers acting on extracted glimpses. Drawing on the neuroanatomy of the human eye we then motivate a solution: foveated convolutions. These parallel convolutions with a range of strides and dilations introduce specific translational variance into the model. In so doing, the foveated convolution presents an inductive bias, encouraging the subject of interest to be centred in the output of the attention mechanism, giving significantly improved performance.
Harris, Ethan William Albert
6d531059-ebaa-451c-b242-5394f0288266
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
13 December 2019
Harris, Ethan William Albert
6d531059-ebaa-451c-b242-5394f0288266
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Harris, Ethan William Albert, Niranjan, Mahesan and Hare, Jonathon
(2019)
Foveated convolutions: improving spatial transformer networks by modelling the retina.
In Shared Visual Representations in Human and Machine Intelligence: 2019 NeurIPS Workshop.
8 pp
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convolutional neural networks in a range of tasks. By ‘focusing’ on the salient parts of the input using a differentiable affine transform, a network augmented with an STN should have increased performance, efficiency and interpretability. However, in practice, STNs rarely exhibit these desiderata, instead converging to a seemingly meaningless transformation of the input. We demonstrate and characterise this localisation problem as deriving from the spatial invariance of feature detection layers acting on extracted glimpses. Drawing on the neuroanatomy of the human eye we then motivate a solution: foveated convolutions. These parallel convolutions with a range of strides and dilations introduce specific translational variance into the model. In so doing, the foveated convolution presents an inductive bias, encouraging the subject of interest to be centred in the output of the attention mechanism, giving significantly improved performance.
Text
5_CameraReadySubmission_workshop
- Version of Record
More information
Published date: 13 December 2019
Identifiers
Local EPrints ID: 441204
URI: http://eprints.soton.ac.uk/id/eprint/441204
PURE UUID: ffa1bd91-5154-4b62-a169-25679765d959
Catalogue record
Date deposited: 04 Jun 2020 16:31
Last modified: 17 Mar 2024 03:11
Export record
Contributors
Author:
Ethan William Albert Harris
Author:
Mahesan Niranjan
Author:
Jonathon Hare
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics