Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data
Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data
Background: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. Methods: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. Results: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. Discussion: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.
Inflammatory bowel disease, genomics, machine learning
1672-1680
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ashton, James
03369017-99b5-40ae-9a43-14c98516f37d
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Cheng, Guo
fdfb3e03-f185-49b1-9c53-05b93bb6c8d0
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
1 October 2023
Stafford, Imogen S.
50987dc1-3772-408f-9093-9124f3d6b2cd
Ashton, James
03369017-99b5-40ae-9a43-14c98516f37d
Mossotto, Enrico
a2a572db-3e95-41c6-94f6-f1b019594372
Cheng, Guo
fdfb3e03-f185-49b1-9c53-05b93bb6c8d0
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Stafford, Imogen S., Ashton, James, Mossotto, Enrico, Cheng, Guo, Beattie, R. Mark and Ennis, Sarah
(2023)
Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data.
Journal of Crohn's and Colitis, 17 (10), .
(doi:10.1093/ecco-jcc/jjad084).
Abstract
Background: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. Methods: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. Results: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. Discussion: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.
Text
10_5_23_untracked_Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data JCC sub
- Accepted Manuscript
Text
Supplementary Information Tables and Figures Supervised ML for IBD subtypes
- Accepted Manuscript
More information
Accepted/In Press date: 12 May 2023
e-pub ahead of print date: 19 May 2023
Published date: 1 October 2023
Additional Information:
Funding Information:
This study was supported by the Institute for Life Sciences, University of Southampton, and the National Institute for Health Research [NIHR] Southampton Biomedical Research Centre. The views expressed are those of the author[s] and not necessarily those of the NIHR or the Department of Health and Social Care. JJA is funded by an NIHR advanced Fellowship (NIHR302478).
Publisher Copyright:
© 2023 The Author(s).
Keywords:
Inflammatory bowel disease, genomics, machine learning
Identifiers
Local EPrints ID: 477607
URI: http://eprints.soton.ac.uk/id/eprint/477607
ISSN: 1873-9946
PURE UUID: f875ea8a-6f06-41be-b4ab-8a996d54130a
Catalogue record
Date deposited: 09 Jun 2023 16:36
Last modified: 15 Aug 2024 01:51
Export record
Altmetrics
Contributors
Author:
Imogen S. Stafford
Author:
Enrico Mossotto
Author:
Guo Cheng
Author:
R. Mark Beattie
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics