READ ME File For "Data SUPPORT THE MPhil Thesis: Towards an understanding of generalisation in deep learning: an analysis of the transformation of information in convolutional neural networks"

Dataset DOI: 10.5258/SOTON/D3540

Date that the file was created: June 2025

-------------------
GENERAL INFORMATION
-------------------

ReadMe Author: Dominic Belcher, University of Southampton

Date of data collection: September 2019 - August 2024

This dataset supports the thesis entitled
Towards an understanding of generalisation in deep learning: an analysis of the transformation of information in convolutional neural networks

AWARDED BY: Univeristy of Southampton

DATE OF AWARD: 2025

Information about geographic location of data collection: N/A

Related projects/funders:
The research towards this thesis was funded by a University of Southampton Faculty of Physical Science and Engineering Research Studentship

DESCRIPTION OF THE DATA
This dataset is the results of the simulations detailed in the above thesis. All results are in jsonlines format. No specialist software is required to read this data, any software for parsing json or jsonlines data is sufficient.


--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

Licenses/restrictions placed on the data, or limitations of reuse: licenced under CC BY licence

Recommended citation for the data:

Links to other publicly accessible locations of the data: N/A

Links/relationships to ancillary or related data sets: N/A


--------------------
DATA & FILE OVERVIEW
--------------------

This dataset contains:

thesis_data.tar.gz : tarball
 - chapter_3.tar.gz : tarball 
   Results of simulations from chapter 3 of thesis
 - chapter_4.tar.gz : tarball
   Results of simulations from chapter 4 of thesis
 - chapter_5.tar.gz : tarball
   Results of simulations from chapter 5 of thesis

Each chapter tarball contains data from simulations, in batches. Each batch is a jsonlines file such as 
 trials/aa/trial_data.jsonl
Each jsonlines file contains one line per simulation trial, with all inputs for given trial, and results of the trial

Relationship between files, if important for context: Files are independent

Additional related data collected that was not included in the current data package: N/A

If there are there multiple versions of the dataset, list the file updated, when and why update was made: N/A


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data: See chapters 3 through 5 of thesis

Methods for processing the data: See chapters 3 through 5 of thesis

Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: no specific software required

Standards and calibration information, if appropriate: N/A

Environmental/experimental conditions: All simulations run on Iridis 5 cluster, using mixture of GPU and high memory nodes

Describe any quality-assurance procedures performed on the data: N/A

People involved with sample collection, processing, analysis and/or submission: N/A


--------------------------
DATA-SPECIFIC INFORMATION
--------------------------

CHAPTER 3 DATA

Number of variables: 4

Number of cases/rows: Approx 2500

Variable list, defining any abbreviations, units of measure, codes or symbols used:
model_name
dataset
split_idx
classifier
   
Missing data codes: N/A

Specialized formats or other abbreviations used: N/A


--------------------------
DATA-SPECIFIC INFORMATION
--------------------------

CHAPTER 4 DATA

Number of variables: 4

Number of cases/rows: Approx 2500

Variable list, defining any abbreviations, units of measure, codes or symbols used:
model_name
dataset
split_idx
classifier
   
Missing data codes: N/A

Specialized formats or other abbreviations used: N/A

--------------------------

CHAPTER 4 DATA

Number of variables: 4

Number of cases/rows: Approx 560000

Variable list, defining any abbreviations, units of measure, codes or symbols used:
model_name
dataset
split_idx
noise_scale
   
Missing data codes: N/A

Specialized formats or other abbreviations used: N/A


--------------------------

CHAPTER 5 DATA

Number of variables: 4

Number of cases/rows: Approx 2500

Variable list, defining any abbreviations, units of measure, codes or symbols used:
model_name
dataset
split_idx
classifier
   
Missing data codes: N/A

Specialized formats or other abbreviations used: N/A





