Data matters: Towards a data-centric theory of generalisation
Data matters: Towards a data-centric theory of generalisation
The ability of a learning machine to perform outside the training data is referred to as its generalisation performance. Despite being researched for many years, generalisation is one of the key unresolved puzzles in machine learning. In this thesis we start building the understanding needed to construct a new framework for reasoning about generalisation. We start with a theoretical perspective but conclude that the field needs to build stronger intuitions before being able to formalise generalisation in a meaningful way. Our theoretical exploration, however, highlights that the data plays a much more central role than previously acknowledged. To better understand how the data can be incorporated in generalisation studies, we start exploring the practice of modifying images. The modifications we consider are mixed data augmentation, patch-shuffling, and patch-based occlusion. We find that there are a number of incorrect implicit assumptions in the literature regarding the side effects of data modification. These assumptions deem some distortion-based approaches to evaluating model attributes to be incorrect. In the case of modifying data to assess robustness to occlusion, we propose a solution that addresses the side effects. The existence of these incorrect assumptions attests to the fact that the field has a poor understanding of data modification. Despite the field’s limited understanding, data distortion has most recently been used to empirically predict generalisation performance. We focus on this practice and claim that data modification has been carelessly used in this case as well. We argue that it is the limited evaluation settings that caused the modification-based predictors to appear successful despite relying on poorly founded intuitions. We end by proposing the backbone for an extensive evaluation of empirical predictors of generalisation. We believe that such a practical approach to generalisation, when thoroughly designed, has the potential to provide the understanding needed to create a theoretical framework in future. Our proposed evaluation setting seeks to explore a variety of data-centric scenarios, highlighting the central role played by the data in the generalisation puzzle.
University of Southampton
Marcu, Antonia
5054fd8c-0a18-41a3-a140-1521d9a19573
2022
Marcu, Antonia
5054fd8c-0a18-41a3-a140-1521d9a19573
Prugel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e
Marcu, Antonia
(2022)
Data matters: Towards a data-centric theory of generalisation.
University of Southampton, Doctoral Thesis, 166pp.
Record type:
Thesis
(Doctoral)
Abstract
The ability of a learning machine to perform outside the training data is referred to as its generalisation performance. Despite being researched for many years, generalisation is one of the key unresolved puzzles in machine learning. In this thesis we start building the understanding needed to construct a new framework for reasoning about generalisation. We start with a theoretical perspective but conclude that the field needs to build stronger intuitions before being able to formalise generalisation in a meaningful way. Our theoretical exploration, however, highlights that the data plays a much more central role than previously acknowledged. To better understand how the data can be incorporated in generalisation studies, we start exploring the practice of modifying images. The modifications we consider are mixed data augmentation, patch-shuffling, and patch-based occlusion. We find that there are a number of incorrect implicit assumptions in the literature regarding the side effects of data modification. These assumptions deem some distortion-based approaches to evaluating model attributes to be incorrect. In the case of modifying data to assess robustness to occlusion, we propose a solution that addresses the side effects. The existence of these incorrect assumptions attests to the fact that the field has a poor understanding of data modification. Despite the field’s limited understanding, data distortion has most recently been used to empirically predict generalisation performance. We focus on this practice and claim that data modification has been carelessly used in this case as well. We argue that it is the limited evaluation settings that caused the modification-based predictors to appear successful despite relying on poorly founded intuitions. We end by proposing the backbone for an extensive evaluation of empirical predictors of generalisation. We believe that such a practical approach to generalisation, when thoroughly designed, has the potential to provide the understanding needed to create a theoretical framework in future. Our proposed evaluation setting seeks to explore a variety of data-centric scenarios, highlighting the central role played by the data in the generalisation puzzle.
Text
Thesis-a3b
- Version of Record
Text
Final-thesis-submission-Examination-Miss-Antonia-Marcu
Restricted to Repository staff only
More information
Published date: 2022
Identifiers
Local EPrints ID: 481319
URI: http://eprints.soton.ac.uk/id/eprint/481319
PURE UUID: 196bfb16-525f-44ec-84bc-041d1b60fe17
Catalogue record
Date deposited: 23 Aug 2023 16:48
Last modified: 16 Mar 2024 23:50
Export record
Contributors
Author:
Antonia Marcu
Thesis advisor:
Adam Prugel-Bennett
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics