The University of Southampton
University of Southampton Institutional Repository

Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction

Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction
Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction
Next-generation sequencing (NGS) help to identify disease-causing genes underlying any given monogenic or complex disease. Concurrently, mathematical tools and statistical methods, includ- ing machine learning algorithms, are rapidly evolving and together, these technologies represent the new frontier of research and clinical management on a path leading toward personalised medicine.

This thesis has been divided into three main sections. Firstly, the Linkage disequilibrium (LD) patterns were observed to understand the combined impact of recombination, natural selection, genetic drift and mutation. LD is the non-random association of alleles at different loci in a given population. To this end, LD patterns were constructed using 454 whole-genome sequences (WGS) from the Wellderly study based on the Malécot Morton model (exponential distributions with restricted parameters). Therefore, the extent of the LD was computed for genic, intergenic, exon and intron regions. The main result demonstrated that significant differences between exonic, intronic and intergenic components demonstrate that fine-scale LD structure provides important insights into genome function, which cannot be revealed by LD analysis of much lower resolution array-based genotyping and conventional linkage maps.

Secondly, machine learning methodologies were applied to classify genes into four groups: essential genes, Mendelian genes, genes associated with complex disorders, and non-essential–non-disease genes. To this end, the dataset was extracted from published studies of biological and functional properties of the genes. Hence, different supervised machine learning (ML) models were studied to select the most important features relevant for classifying genes. Simultaneously, Bayesian inference in a Gaussian graphical model (BGGM) was carried out to investigate recognising the significant features to enclose genes. Once the relevant features had been selected, a proposed unsupervised ML approach was developed to cluster genes into those four groups. The combined analysis of genomic data for gradient boosting and random forest models showed that more than 50% of the variance was explained and the results from BGMM showed that the connectivity between these gene metrics was 40%. The proposed unsupervised model showed an improvement for classifying genes into Mendelian group. However, results suggested that some genes involved in developing Mendelian disorders overlap with complex disorders.


Thirdly, a polygenic risk score (PRS) was developed to quantify the cumulative effect of low- penetrance genetic variants on breast cancer (BC), following the hypothesis that the polygenic component has an important impact on BC patients, as do BRCAs variants. Genome data from POSH and WTCCC were used to generate the PRS. This score was computed based on the surprisal theory. As a result, relative genome information per individual (RGI) was estimated to understand how unusual a genome is related to the reference genome. Thus, a person with a higher RGI has a more unusual genome. Likewise, a lower RGI corresponds to having more common alleles, and therefore a less surprising genome. The PRS for women who carry BRCA1/2 mutations or intermediate-risk/common variants demonstrated the hypothesis that the BC cases contain a strong inherited polygenic component. Furthermore, the polygenic component carriers tend to have more significant changes in allele frequencies compared to BRCA1 and BRCA2 variants.

This thesis presents methodological contributions to predictive models based on machine learning techniques and mathematical programming, together with relevant insights into disease mechanisms and potential treatment options.
University of Southampton
Vergara Lope Gracia, Norma
84ea3389-86d8-4b89-ad65-729173419305
Vergara Lope Gracia, Norma
84ea3389-86d8-4b89-ad65-729173419305
Collins, Andrew
7daa83eb-0b21-43b2-af1a-e38fb36e2a64
Pengelly, Reuben
af97c0c1-b568-415c-9f59-1823b65be76d
Tapper, William
9d5ddc92-a8dd-4c78-ac67-c5867b62724c
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Macarthur, Benjamin
2c0476e7-5d3e-4064-81bb-104e8e88bb6b

Vergara Lope Gracia, Norma (2021) Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction. University of Southampton, Doctoral Thesis, 204pp.

Record type: Thesis (Doctoral)

Abstract

Next-generation sequencing (NGS) help to identify disease-causing genes underlying any given monogenic or complex disease. Concurrently, mathematical tools and statistical methods, includ- ing machine learning algorithms, are rapidly evolving and together, these technologies represent the new frontier of research and clinical management on a path leading toward personalised medicine.

This thesis has been divided into three main sections. Firstly, the Linkage disequilibrium (LD) patterns were observed to understand the combined impact of recombination, natural selection, genetic drift and mutation. LD is the non-random association of alleles at different loci in a given population. To this end, LD patterns were constructed using 454 whole-genome sequences (WGS) from the Wellderly study based on the Malécot Morton model (exponential distributions with restricted parameters). Therefore, the extent of the LD was computed for genic, intergenic, exon and intron regions. The main result demonstrated that significant differences between exonic, intronic and intergenic components demonstrate that fine-scale LD structure provides important insights into genome function, which cannot be revealed by LD analysis of much lower resolution array-based genotyping and conventional linkage maps.

Secondly, machine learning methodologies were applied to classify genes into four groups: essential genes, Mendelian genes, genes associated with complex disorders, and non-essential–non-disease genes. To this end, the dataset was extracted from published studies of biological and functional properties of the genes. Hence, different supervised machine learning (ML) models were studied to select the most important features relevant for classifying genes. Simultaneously, Bayesian inference in a Gaussian graphical model (BGGM) was carried out to investigate recognising the significant features to enclose genes. Once the relevant features had been selected, a proposed unsupervised ML approach was developed to cluster genes into those four groups. The combined analysis of genomic data for gradient boosting and random forest models showed that more than 50% of the variance was explained and the results from BGMM showed that the connectivity between these gene metrics was 40%. The proposed unsupervised model showed an improvement for classifying genes into Mendelian group. However, results suggested that some genes involved in developing Mendelian disorders overlap with complex disorders.


Thirdly, a polygenic risk score (PRS) was developed to quantify the cumulative effect of low- penetrance genetic variants on breast cancer (BC), following the hypothesis that the polygenic component has an important impact on BC patients, as do BRCAs variants. Genome data from POSH and WTCCC were used to generate the PRS. This score was computed based on the surprisal theory. As a result, relative genome information per individual (RGI) was estimated to understand how unusual a genome is related to the reference genome. Thus, a person with a higher RGI has a more unusual genome. Likewise, a lower RGI corresponds to having more common alleles, and therefore a less surprising genome. The PRS for women who carry BRCA1/2 mutations or intermediate-risk/common variants demonstrated the hypothesis that the BC cases contain a strong inherited polygenic component. Furthermore, the polygenic component carriers tend to have more significant changes in allele frequencies compared to BRCA1 and BRCA2 variants.

This thesis presents methodological contributions to predictive models based on machine learning techniques and mathematical programming, together with relevant insights into disease mechanisms and potential treatment options.

Text
NAVLG_THESIS_81121 - Version of Record
Available under License University of Southampton Thesis Licence.
Download (8MB)
Text
PTD
Restricted to Repository staff only

More information

Published date: September 2021

Identifiers

Local EPrints ID: 474090
URI: http://eprints.soton.ac.uk/id/eprint/474090
PURE UUID: f26326d9-aba3-4572-8710-7232c5431d78
ORCID for Andrew Collins: ORCID iD orcid.org/0000-0001-7108-0771
ORCID for Reuben Pengelly: ORCID iD orcid.org/0000-0001-7022-645X
ORCID for William Tapper: ORCID iD orcid.org/0000-0002-5896-1889
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X
ORCID for Benjamin Macarthur: ORCID iD orcid.org/0000-0002-5396-9750

Catalogue record

Date deposited: 13 Feb 2023 17:50
Last modified: 17 Mar 2024 03:33

Export record

Contributors

Author: Norma Vergara Lope Gracia
Thesis advisor: Andrew Collins ORCID iD
Thesis advisor: Reuben Pengelly ORCID iD
Thesis advisor: William Tapper ORCID iD
Thesis advisor: Mahesan Niranjan ORCID iD
Thesis advisor: Benjamin Macarthur ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×