READ ME File For 'Development of childhood asthma prediction models using machine learning and data integration'

Dataset DOI: 10.5258/SOTON/D1943

ReadMe Author: Dilini Kothalawala, University of Southampton, orcid.org/0000-0002-5804-0457

This dataset supports the thesis entitled Development of childhood asthma prediction models using machine learning and data integration
AWARDED BY: Univeristy of Southampton
DATE OF AWARD: 2021

DESCRIPTION OF THE DATA
Data comprises of clinical data from the Isle of Wight Birth Cohort (IOWBC) used in this thesis, including data dictionaries, intermediate datasets relevant for analyses performed in this thesis as well as documentation for ethical approval, patient consent forms and participant information sheets.
Source code for analyses performed in this these are also provided alongside supplementary results, including: full descriptions of the candidate features considered during the development of the genomic risk scores for childhood asthma; full descriptions of the performance measures reported in the IOWBC for all candidate prediction models developed using machine learning approaches; all final trained childhood asthma prediction models to support the future application of the models developed in this thesis.

Scripts and Supplementary Results are available to download, datasets are available 'on request' only to bone fide researchers. Please complete the attached request form and we will seek approval for the request from the IOW Birth Cohort Data Access Committee

This dataset contains:

Datasets:
- Chapter 4 - Existing model replication
	- IOWBC_existing model_replication_data - data obtained from the IOWBC used in the replication analysis detailed in Chapter 4 of this thesis.

- Chapter 5 - Machine learning model development
	- IOWBC_data - raw and cleaned datasets, including a data dictionary, obtained from the IOWBC to develop the machine learning models described in Chapter 5 of this thesis. 
	- IOWBC_imputed_data - datasets generated following imputation, as described in Chapter 5 of this thesis.
	- IOWBC_training_test_data - training and test datasets generated for the early life (CAPE) and preschool models (CAPP), as described in Chapter 5 of this thesis.
	- Oversampled intermediate datasets 
		- all intermediate datasets generated following the application of ADAptive SYNthetic (over)sampling during model training for the CAPE and CAPP models detailed in Chapters 5 of this thesis.

- Chapter 6 - Genomic data integration
	- IOWBC_PRS_data - PRS calcuated for each individual in the IOWBC, including training and test set allocations, as described in Chapter 6 of this thesis. 
	- IOWBC_MRS_data - nMRS and cMRS calcuated for each individual in the IOWBC, including training and test set allocations, as described in Chapter 6 of this thesis.
	- IOWBC_CAPE_integrated_data - datasets containing CAPE model features, PRS and MRS data for each individual in the IOWBC, including training and test set allocations, as described in Chapter 6 of this thesis.
	- IOWBC_CAPP_integrated_data - datasets containing CAPP model features, PRS and MRS data for each individual in the IOWBC, including training and test set allocations, as described in Chapter 6 of this thesis.
	- Oversampled intermediate datasets 
		- all intermediate datasets generated following the application of ADAptive SYNthetic (over)sampling during model training for the PRS, nMRS and cMRS-only machine learning models detailed in Chapters 6 of this thesis.


Supplementary Results:
- Results
	- Details_of_genomic_risk_score_candidate_features.xlsx - descriptions for the genotype and methyation data used to construct the polygenic and methylation risk scores detailed in Chapter 6 of this thesis.
	- Performance_of_all_models_developed_using_machine_learning.xlsx - performance metrics for all models (CAPE, CAPP, PRS, MRS, integrated CAPE and integrated CAPP machine learning models) developed in Chapters 5 and 6 of this thesis.
- Models
	- Trained models developed in this thesis (all final CAPE, CAPP, PRS and MRS individual and integrated models). 
	- Script detailing how the enclosed models can be used to obtain childhood asthma predictions on new datasets. 

Scripts:
- Chapter 4 - Existing model replication 
	- script to generate and compare predictions made by existing asthma prediction models for each individual in the IOWBC.	

- Chapter 5 - Machine learning model development
	- Data cleaning
		- scripts to encode and clean the raw IOWBC data.
	- Feature selection 
		- script to perform feature selection using both recursive feature elimination and boruta methods.
	- Evaluation of training optimisation techniques 
		- scripts to assess changes in model performance of the best initial prediction model (linear SVM) following the application of imputation and resamping to address missing data and class imbalances. 
	- Training dataset optimisation 
		- scripts to apply imputation and resampling to the CAPE and CAPP training datasets.
	- Model development	
		- Training-test set allocations - scripts in this folder generate the training and test datasets following the application of the various optimisation techniques, in the correct format needed for model development. 
		- Scripts to develop the models using the 8 different algorithms, across the different training optimisation datasets.
		- Script to evaluate performance measures using bootstrapping.
	- Sensitivity analyses
		- SHAP 	
			- scripts to evaulate the interpretability of the CAPE and CAPP models using SHapley Additive ePlanations (SHAP). 
			- scripts to develop CAPE and CAPP models restricted to the features considered important by SHAP.
		- Logistic regression models 
			- scripts to develop logistic regression models equivalent to the CAPE and CAPP models. 

- Chapter 6 - Genomic data integration
	- Polygenic risk score
		- script to identify proxy candidate SNPs from LDlink for inclusion in the PRS.
		- script to construct the PRS in PLINK and PRSice. 
		- script to apply resampling to the training data containing the PRS
		- script to obtain training and test sets to construct machine learning PRS-only models
	- Methylation risk scores
		- script to perform feature selection using recursive feature elimination to identify CpGs to include in the newborn (nMRS) and childhood (cMRS) 
		- scripts to calcuate the MRSs using five different methods identified from the literature.
		- scripts to  resampling to the training data containing the nMRS and cMRS.
		- scripts to obtain training and test sets to construct machine learning nMRS/cMRS-only models
	- Model integration 
		- scripts to integrate the PRS and/or MRSs with the CAPE and CAPP models. 

Ethics and copyright permission documents relevant to this thesis.

Date of data collection: October 2018 - September 2021

Information about geographic location of data collection: 

Licence:
Restricted

Related projects/Funders:
NIHR Southampton Biomedical Research Centre; University of Southamptopn Presidential Scholarship


Related publication:
Development of childhood asthma prediction models using machine learning approaches - https://doi.org/10.1002/clt2.12076
Kothalawala, D. M., Perunthadambil Kadalayil, L., Weiss, V. B. N., Kyyaly, M. A., Arshad, S., Holloway, J., & Rezwan, F. I. (2020). Prediction models for childhood asthma: a systematic review. Pediatric Allergy and Immunology, 31(6), 616-627. https://doi.org/10.1111/pai.13247

Date that the file was created: November, 2021