READ ME File For 'Data supporting the University of Southampton Doctoral Thesis "Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease".' Dataset DOI: 10.5258/SOTON/D2655 ReadMe Author: IMOGEN SIAN STAFFORD, University of Southampton ORCID ID 0000-0003-1666-1906 This dataset supports the thesis entitled "Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease" AWARDED BY: Univeristy of Southampton DATE OF AWARD: 2023 DESCRIPTION OF THE DATA This dataset object predominantly contains computational scripts that provide instruction for reproducing data processing and machine learning models generated within this thesis. Descriptions of these files are given according to the Chapter they are used in as follows: Chapter 3: * Python and shell scripts required to execute the joint calling pipeline and GenePy 1.3 pipeline detailed in Sections 3.3.4 and 3.3.5. * Accompanying pdf of static GitHub pages providing further instruction and detail for the joint calling pipeline and GenePy 1.3 pipeline detailed in Sections 3.3.4 and 3.3.5. * Python and shell scripts required to execute the updated joint calling pipeline and GenePy 1.4 pipeline detailed in Section 3.4. * Accompanying pdf of static GitHub pages providing further instruction and detail for the updated joint calling pipeline and GenePy pipeline detailed in Section 3.4. Chapter 4: * R scripts for the supervised machine learning of genomic data and oxidative stress and antioxidant potential assay data Chapter 5 and Chapter 6: * Quality control report for the IBD cohort * Remapped list of highly mutable genes, from Fuentes Fajardo et al. * Gene panels utilised in machine learning: autoimmune disease gene panel, inflammatory bowel disease monogenic genes and GWAS genes panel, stricturing endotype gene panels, and NOD-signalling pathway gene panel. Some of these panels are also used in Chapter 7. * Python scripts for machine learning for classification of inflammatory bowel disease subtype, the Crohn’s disease stricturing endotype, and age of onset classifiers * Genes identified in a literature review that are associated with the Crohn’s disease stricturing endotype, and their source. Chapter 7: * Python script of random survival forest for stratification of Crohn’s disease patients by stricturing endotype * Results for the top features selected during Bayes Search and Grid Search trials to monitor feature selection stability. This dataset contains: ch3_genomic_data_processing - initial joint calling pipeline * ALIGN.sh * CALL.sh * all_chr.list * catvars.sh * combiner.sh * gtyper.sh * Filtering.sh * meanGQ_filter.sh * Recalibrate.sh * ch3_initial_joint_calling_pipeline.pdf ch3_genomic_data_processing - initial GenePy 1.3 pipeline * cross-annotate-cadd.py * GenePy_1.3.sh * generate_final_matrix.py * make_scores_mat_6.py * MatrixMaker.sh * subber.sh * ch3_GenePy-1.3.pdf ch3_genomic_data_processing - 2020 updated joint calling pipeline * preprocess.sh * caller.sh * joint_calling.sh * vqsr.sh * variant_HardFiltration.sh * meanGQ_filter.sh * ch3_joint_call_pipeline_update.pdf ch3_genomic_data_processing - GenePy 1.4 pipeline * vep.sh * vep_x.sh * genepy_combine_annotations.py * GENEPY_1.3.sh * generate_final_matrix.py * make_scores_mat_6.py * MatrixMaker.sh * subber.sh * GitHub - UoS-HGIG_GenePy-1.4.pdf ch4_machine_learning_scripts * ch4_FRAP_machine_learning.Rmd * ch4_TBARS_machine_learning.Rmd * ch4_TFT_machine_learning.Rmd gene_panels_and_lists (used in chapters 5,6 and 7) * 20211004_ibd_monogwas.txt * 20211004_stricturing_genes_inclusive.txt * 20211005_fuentes_falsepositives.txt * 20220106_strict_panel_exclusive.txt * 20220107_keggnodsigpathway.txt * HTG_seq_AI.txt ch5_6_scripts_qc_stricturing_genes * ch5and6_RF_IBDsubtype.py * ch5and6_RF_strictendotype.py * ch5and6_RF_ageofonset.py * ch6_stricturing_genes_review.xlsx * QC_report_IBD_cohort.xlsx ch7_supplementary * ch7_random_survival_forest_stricturing.py * ch7_gridsearch_bayessearch_trials.xlsx Licence: CC-BY Related Funders: Institute for Life Sciences National Institute for Health Research Biomedical Research Centre Southampton Date that the file was created: September, 2023