Data matters: Towards a data-centric theory of generalisation

The ability of a learning machine to perform outside the training data is referred to as its generalisation performance. Despite being researched for many years, generalisation is one of the key unresolved puzzles in machine learning. In this thesis we start building the understanding needed to construct a new framework for reasoning about generalisation. We start with a theoretical perspective but conclude that the field needs to build stronger intuitions before being able to formalise generalisation in a meaningful way. Our theoretical exploration, however, highlights that the data plays a much more central role than previously acknowledged. To better understand how the data can be incorporated in generalisation studies, we start exploring the practice of modifying images. The modifications we consider are mixed data augmentation, patch-shuffling, and patch-based occlusion. We find that there are a number of incorrect implicit assumptions in the literature regarding the side effects of data modification. These assumptions deem some distortion-based approaches to evaluating model attributes to be incorrect. In the case of modifying data to assess robustness to occlusion, we propose a solution that addresses the side effects. The existence of these incorrect assumptions attests to the fact that the field has a poor understanding of data modification. Despite the field’s limited understanding, data distortion has most recently been used to empirically predict generalisation performance. We focus on this practice and claim that data modification has been carelessly used in this case as well. We argue that it is the limited evaluation settings that caused the modification-based predictors to appear successful despite relying on poorly founded intuitions. We end by proposing the backbone for an extensive evaluation of empirical predictors of generalisation. We believe that such a practical approach to generalisation, when thoroughly designed, has the potential to provide the understanding needed to create a theoretical framework in future. Our proposed evaluation setting seeks to explore a variety of data-centric scenarios, highlighting the central role played by the data in the generalisation puzzle.

University of Southampton

Marcu, Antonia

5054fd8c-0a18-41a3-a140-1521d9a19573

2022

Marcu, Antonia

5054fd8c-0a18-41a3-a140-1521d9a19573

Prugel-Bennett, Adam

b107a151-1751-4d8b-b8db-2c395ac4e14e

Marcu, Antonia (2022) Data matters: Towards a data-centric theory of generalisation. University of Southampton, Doctoral Thesis, 166pp.

Record type: Thesis (Doctoral)

Abstract

Text

Thesis-a3b - Version of Record

Available under License University of Southampton Thesis Licence.

Download (2MB)

Text

Final-thesis-submission-Examination-Miss-Antonia-Marcu

Restricted to Repository staff only

More information

Published date: 2022

Related URLs:

Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 481319

URI: http://eprints.soton.ac.uk/id/eprint/481319

PURE UUID: 196bfb16-525f-44ec-84bc-041d1b60fe17

Catalogue record

Date deposited: 23 Aug 2023 16:48

Last modified: 16 Mar 2024 23:50

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Antonia Marcu

Thesis advisor: Adam Prugel-Bennett

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information