Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease
Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease
Inflammatory bowel disease (IBD) is a chronic, complex autoimmune disease characterised by relapsing-remitting gastrointestinal tract inflammation. It is considered to arise from interactions between an individual’s genetic susceptibility, environmental factors, immune dysregulation, and gut microbial dysbiosis. Genetics can make a larger contribution to IBD pathology in some patients, and this is thought to be linked to age of diagnosis, with genetic factors having the largest effects in very young children. There are two main subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Within subtypes, there are different disease behaviours and severities. One particular disease behaviour of interest is the stricturing endotype, which causes a narrowing of the gastrointestinal tract that often requires surgery. This thesis first examines oxidative stress in IBD patients, through the use of assay data. Here, statistical and machine learning (ML) methods are employed to examine the relationship between clinical and genomic characteristics of a set of paediatric patients, and their measured oxidative stress and antioxidant potential. In this work, no results suggested that these assay data could be used as an indicator for these clinical features, or for pathogenic variation in key oxidative stress genes. The predominant focus of this thesis is the use of genomic data and ML to stratify IBD patients. In order to prepare genomic data for use in ML pipelines, the GenePy algorithm was used. GenePy takes in information regarding zygosity, allele frequency, and predicted deleteriousness for every variant in a gene. The scores for each variant are summed to create an overall gene score, and this becomes are per-gene, per-individual matrix of scores. The two clinical problems analysed here were classifying IBD patients according to their subtype, and stratifying CD patients by the presence or absence of a stricturing endotype. This was achieved with an ML random forest classifier. Optimisation of both the input data and ML algorithm for these classifications was a important aspect of this work. Several gene panels were trialled for these classifications, and an autoimmune gene panel outperformed an IBD gene panel for determining IBD subtype. Stratifying CD patients by their stricturing endotype was subsequently performed with a random survival forest, which combined a random forest with survival analysis methods. This method is better suited to the longitudinal nature of stricturing endotype developed. This work demonstrated challenges that arise from the sparsity of genomic data, and required the development of a pipeline that could reduce the sparsity of the features used by the ML algorithm. The patient stratification performed here demonstrated strong evidence for the presence of different genomic variation patterns within IBD subtypes, and within the CD stricturing endotype. With increased dataset sizes, it may be possible to more clearly detect and cluster patients according to their genomic variation. In order to take full advantage of this knowledge, there is an additional requirement for deep, varied and longitudinal clinical data. Then, genomic data can guide each patient’s clinical pathway, providing individuals with more personalised, life-long care.
Machine Learning, Inflammatory bowel disease, Genomics
University of Southampton
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
September 2023
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Beattie, Robert M
9a66af0b-f81c-485c-b01d-519403f0038a
Macarthur, Benjamin
2c0476e7-5d3e-4064-81bb-104e8e88bb6b
Stafford, Imogen S.
(2023)
Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease.
University of Southampton, Doctoral Thesis, 396pp.
Record type:
Thesis
(Doctoral)
Abstract
Inflammatory bowel disease (IBD) is a chronic, complex autoimmune disease characterised by relapsing-remitting gastrointestinal tract inflammation. It is considered to arise from interactions between an individual’s genetic susceptibility, environmental factors, immune dysregulation, and gut microbial dysbiosis. Genetics can make a larger contribution to IBD pathology in some patients, and this is thought to be linked to age of diagnosis, with genetic factors having the largest effects in very young children. There are two main subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Within subtypes, there are different disease behaviours and severities. One particular disease behaviour of interest is the stricturing endotype, which causes a narrowing of the gastrointestinal tract that often requires surgery. This thesis first examines oxidative stress in IBD patients, through the use of assay data. Here, statistical and machine learning (ML) methods are employed to examine the relationship between clinical and genomic characteristics of a set of paediatric patients, and their measured oxidative stress and antioxidant potential. In this work, no results suggested that these assay data could be used as an indicator for these clinical features, or for pathogenic variation in key oxidative stress genes. The predominant focus of this thesis is the use of genomic data and ML to stratify IBD patients. In order to prepare genomic data for use in ML pipelines, the GenePy algorithm was used. GenePy takes in information regarding zygosity, allele frequency, and predicted deleteriousness for every variant in a gene. The scores for each variant are summed to create an overall gene score, and this becomes are per-gene, per-individual matrix of scores. The two clinical problems analysed here were classifying IBD patients according to their subtype, and stratifying CD patients by the presence or absence of a stricturing endotype. This was achieved with an ML random forest classifier. Optimisation of both the input data and ML algorithm for these classifications was a important aspect of this work. Several gene panels were trialled for these classifications, and an autoimmune gene panel outperformed an IBD gene panel for determining IBD subtype. Stratifying CD patients by their stricturing endotype was subsequently performed with a random survival forest, which combined a random forest with survival analysis methods. This method is better suited to the longitudinal nature of stricturing endotype developed. This work demonstrated challenges that arise from the sparsity of genomic data, and required the development of a pipeline that could reduce the sparsity of the features used by the ML algorithm. The patient stratification performed here demonstrated strong evidence for the presence of different genomic variation patterns within IBD subtypes, and within the CD stricturing endotype. With increased dataset sizes, it may be possible to more clearly detect and cluster patients according to their genomic variation. In order to take full advantage of this knowledge, there is an additional requirement for deep, varied and longitudinal clinical data. Then, genomic data can guide each patient’s clinical pathway, providing individuals with more personalised, life-long care.
Text
Doctoral Thesis for Imogen Stafford PDFA - Integration_of_health_informatics_big_data_for_clinical_translation_in_IBD
- Version of Record
Text
Final-thesis-submission-Examination-Miss-Imogen-Stafford
Restricted to Repository staff only
More information
Submitted date: June 2023
Published date: September 2023
Keywords:
Machine Learning, Inflammatory bowel disease, Genomics
Identifiers
Local EPrints ID: 482291
URI: http://eprints.soton.ac.uk/id/eprint/482291
PURE UUID: 4f5190ec-2fbd-4faf-8784-09081610d974
Catalogue record
Date deposited: 26 Sep 2023 16:36
Last modified: 18 Mar 2024 03:49
Export record
Contributors
Author:
Imogen S. Stafford
Thesis advisor:
Enrico Mossotto
Thesis advisor:
Robert M Beattie
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics