The University of Southampton
University of Southampton Institutional Repository

Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data

Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data
Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data

Background: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. Methods: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. Results: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. Discussion: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.

Inflammatory bowel disease, genomics, machine learning
1873-9946
1672-1680
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ashton, James
03369017-99b5-40ae-9a43-14c98516f37d
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Cheng, Guo
fdfb3e03-f185-49b1-9c53-05b93bb6c8d0
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ashton, James
03369017-99b5-40ae-9a43-14c98516f37d
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Cheng, Guo
fdfb3e03-f185-49b1-9c53-05b93bb6c8d0
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9

Stafford, Imogen S., Ashton, James, Mossotto, Enrico, Cheng, Guo, Beattie, R. Mark and Ennis, Sarah (2023) Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data. Journal of Crohn's and Colitis, 17 (10), 1672-1680. (doi:10.1093/ecco-jcc/jjad084).

Record type: Article

Abstract

Background: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. Methods: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. Results: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. Discussion: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.

Text
10_5_23_untracked_Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data JCC sub - Accepted Manuscript
Download (450kB)
Text
Supplementary Information Tables and Figures Supervised ML for IBD subtypes - Accepted Manuscript
Download (484kB)

More information

Accepted/In Press date: 12 May 2023
e-pub ahead of print date: 19 May 2023
Published date: 1 October 2023
Additional Information: Funding Information: This study was supported by the Institute for Life Sciences, University of Southampton, and the National Institute for Health Research [NIHR] Southampton Biomedical Research Centre. The views expressed are those of the author[s] and not necessarily those of the NIHR or the Department of Health and Social Care. JJA is funded by an NIHR advanced Fellowship (NIHR302478). Publisher Copyright: © 2023 The Author(s).
Keywords: Inflammatory bowel disease, genomics, machine learning

Identifiers

Local EPrints ID: 477607
URI: http://eprints.soton.ac.uk/id/eprint/477607
ISSN: 1873-9946
PURE UUID: f875ea8a-6f06-41be-b4ab-8a996d54130a
ORCID for Imogen S. Stafford: ORCID iD orcid.org/0000-0003-1666-1906
ORCID for James Ashton: ORCID iD orcid.org/0000-0003-0348-8198
ORCID for Sarah Ennis: ORCID iD orcid.org/0000-0003-2648-0869

Catalogue record

Date deposited: 09 Jun 2023 16:36
Last modified: 15 Aug 2024 01:51

Export record

Altmetrics

Contributors

Author: Imogen S. Stafford ORCID iD
Author: James Ashton ORCID iD
Author: Enrico Mossotto
Author: Guo Cheng
Author: R. Mark Beattie
Author: Sarah Ennis ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×