The University of Southampton
University of Southampton Institutional Repository

Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease

Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease
Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease
Inflammatory bowel disease (IBD) is a chronic, complex autoimmune disease characterised by relapsing-remitting gastrointestinal tract inflammation. It is considered to arise from interactions between an individual’s genetic susceptibility, environmental factors, immune dysregulation, and gut microbial dysbiosis. Genetics can make a larger contribution to IBD pathology in some patients, and this is thought to be linked to age of diagnosis, with genetic factors having the largest effects in very young children. There are two main subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Within subtypes, there are different disease behaviours and severities. One particular disease behaviour of interest is the stricturing endotype, which causes a narrowing of the gastrointestinal tract that often requires surgery. This thesis first examines oxidative stress in IBD patients, through the use of assay data. Here, statistical and machine learning (ML) methods are employed to examine the relationship between clinical and genomic characteristics of a set of paediatric patients, and their measured oxidative stress and antioxidant potential. In this work, no results suggested that these assay data could be used as an indicator for these clinical features, or for pathogenic variation in key oxidative stress genes. The predominant focus of this thesis is the use of genomic data and ML to stratify IBD patients. In order to prepare genomic data for use in ML pipelines, the GenePy algorithm was used. GenePy takes in information regarding zygosity, allele frequency, and predicted deleteriousness for every variant in a gene. The scores for each variant are summed to create an overall gene score, and this becomes are per-gene, per-individual matrix of scores. The two clinical problems analysed here were classifying IBD patients according to their subtype, and stratifying CD patients by the presence or absence of a stricturing endotype. This was achieved with an ML random forest classifier. Optimisation of both the input data and ML algorithm for these classifications was a important aspect of this work. Several gene panels were trialled for these classifications, and an autoimmune gene panel outperformed an IBD gene panel for determining IBD subtype. Stratifying CD patients by their stricturing endotype was subsequently performed with a random survival forest, which combined a random forest with survival analysis methods. This method is better suited to the longitudinal nature of stricturing endotype developed. This work demonstrated challenges that arise from the sparsity of genomic data, and required the development of a pipeline that could reduce the sparsity of the features used by the ML algorithm. The patient stratification performed here demonstrated strong evidence for the presence of different genomic variation patterns within IBD subtypes, and within the CD stricturing endotype. With increased dataset sizes, it may be possible to more clearly detect and cluster patients according to their genomic variation. In order to take full advantage of this knowledge, there is an additional requirement for deep, varied and longitudinal clinical data. Then, genomic data can guide each patient’s clinical pathway, providing individuals with more personalised, life-long care.
Machine Learning, Inflammatory bowel disease, Genomics
University of Southampton
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Beattie, Robert M
9a66af0b-f81c-485c-b01d-519403f0038a
Macarthur, Benjamin
2c0476e7-5d3e-4064-81bb-104e8e88bb6b

Stafford, Imogen S. (2023) Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease. University of Southampton, Doctoral Thesis, 396pp.

Record type: Thesis (Doctoral)

Abstract

Inflammatory bowel disease (IBD) is a chronic, complex autoimmune disease characterised by relapsing-remitting gastrointestinal tract inflammation. It is considered to arise from interactions between an individual’s genetic susceptibility, environmental factors, immune dysregulation, and gut microbial dysbiosis. Genetics can make a larger contribution to IBD pathology in some patients, and this is thought to be linked to age of diagnosis, with genetic factors having the largest effects in very young children. There are two main subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Within subtypes, there are different disease behaviours and severities. One particular disease behaviour of interest is the stricturing endotype, which causes a narrowing of the gastrointestinal tract that often requires surgery. This thesis first examines oxidative stress in IBD patients, through the use of assay data. Here, statistical and machine learning (ML) methods are employed to examine the relationship between clinical and genomic characteristics of a set of paediatric patients, and their measured oxidative stress and antioxidant potential. In this work, no results suggested that these assay data could be used as an indicator for these clinical features, or for pathogenic variation in key oxidative stress genes. The predominant focus of this thesis is the use of genomic data and ML to stratify IBD patients. In order to prepare genomic data for use in ML pipelines, the GenePy algorithm was used. GenePy takes in information regarding zygosity, allele frequency, and predicted deleteriousness for every variant in a gene. The scores for each variant are summed to create an overall gene score, and this becomes are per-gene, per-individual matrix of scores. The two clinical problems analysed here were classifying IBD patients according to their subtype, and stratifying CD patients by the presence or absence of a stricturing endotype. This was achieved with an ML random forest classifier. Optimisation of both the input data and ML algorithm for these classifications was a important aspect of this work. Several gene panels were trialled for these classifications, and an autoimmune gene panel outperformed an IBD gene panel for determining IBD subtype. Stratifying CD patients by their stricturing endotype was subsequently performed with a random survival forest, which combined a random forest with survival analysis methods. This method is better suited to the longitudinal nature of stricturing endotype developed. This work demonstrated challenges that arise from the sparsity of genomic data, and required the development of a pipeline that could reduce the sparsity of the features used by the ML algorithm. The patient stratification performed here demonstrated strong evidence for the presence of different genomic variation patterns within IBD subtypes, and within the CD stricturing endotype. With increased dataset sizes, it may be possible to more clearly detect and cluster patients according to their genomic variation. In order to take full advantage of this knowledge, there is an additional requirement for deep, varied and longitudinal clinical data. Then, genomic data can guide each patient’s clinical pathway, providing individuals with more personalised, life-long care.

Text
Doctoral Thesis for Imogen Stafford PDFA - Integration_of_health_informatics_big_data_for_clinical_translation_in_IBD - Version of Record
Available under License University of Southampton Thesis Licence.
Download (12MB)
Text
Final-thesis-submission-Examination-Miss-Imogen-Stafford
Restricted to Repository staff only
Available under License University of Southampton Thesis Licence.

More information

Submitted date: June 2023
Published date: September 2023
Keywords: Machine Learning, Inflammatory bowel disease, Genomics

Identifiers

Local EPrints ID: 482291
URI: http://eprints.soton.ac.uk/id/eprint/482291
PURE UUID: 4f5190ec-2fbd-4faf-8784-09081610d974
ORCID for Imogen S. Stafford: ORCID iD orcid.org/0000-0003-1666-1906
ORCID for Sarah Ennis: ORCID iD orcid.org/0000-0003-2648-0869
ORCID for Benjamin Macarthur: ORCID iD orcid.org/0000-0002-5396-9750

Catalogue record

Date deposited: 26 Sep 2023 16:36
Last modified: 18 Mar 2024 03:49

Export record

Contributors

Author: Imogen S. Stafford ORCID iD
Thesis advisor: Sarah Ennis ORCID iD
Thesis advisor: Enrico Mossotto
Thesis advisor: Robert M Beattie
Thesis advisor: Benjamin Macarthur ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×