Working across the Omic scales in high throughput data-driven biology
Working across the Omic scales in high throughput data-driven biology
The expression of a gene, encoded as a sequence of nucleotides in the genome of an organism, and its regulations can be observed at various levels: messenger RNA levels, translation levels in the ribosome, the cellular concentration of the resulting protein, mutations and aberrations at the sequence level, epigenetic markers and its regulations with microRNAs. Bulk of the research in bioinformatics, however, is to make high throughput measurements at any one of these ‘levels’ (or ‘views’) and look for their functional implications such as finding biomarkers of complex diseases. Such analyzes ignore relationships that exist across these levels of observation, and can also lead to misleading results: for example genes that show similar expression between two cohorts of interest may be differently expressed at the protein level due to regulation being the cause of the difference between the cohorts. In this work, I consider approaches for integrative analysis across these levels of gene expression and study examples from classification and regression (survival prediction). For these studies, I use publicly archived data from The Cancer Genome Atlas (TCGA). For classification and regression, I carry out extensive feature selection using Fisher Ratio and Greedy Forward Feature Selection algorithms, respectively. I quantify the performance of classifiers designed using one view of data, evaluated on another. I then carry out feature selection in an integrative way and found that integrative analysis of multi-omic data might not significantly well perform than the best single level. Another study uses an in-house data measured in a single laboratory. I use this data to show the importance of measuring translation rates (monosome and polysome) in the studies related to transcriptome-proteome correlation. I binarise mRNA data in this analysis to reduce the errors in numerical precision of the biological experiments, cluster the genes based on their binary values and used GO analysis to validate them biologically. Within my clusters, I show that including translation rates in protein prediction models may explain the protein level more accurately than total mRNA alone (conventionally measured). At the same time, I show the impact of some sequence-derived features on these models and how these impacts vary between clusters. Hence, using the whole features on bulk of genes is not an efficient way of doing this prediction.
University of Southampton
Jeyananthan, Pratheeba
f4c533ad-d3f4-43c5-ae6e-5f201729b71d
April 2020
Jeyananthan, Pratheeba
f4c533ad-d3f4-43c5-ae6e-5f201729b71d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Jeyananthan, Pratheeba
(2020)
Working across the Omic scales in high throughput data-driven biology.
University of Southampton, Doctoral Thesis, 201pp.
Record type:
Thesis
(Doctoral)
Abstract
The expression of a gene, encoded as a sequence of nucleotides in the genome of an organism, and its regulations can be observed at various levels: messenger RNA levels, translation levels in the ribosome, the cellular concentration of the resulting protein, mutations and aberrations at the sequence level, epigenetic markers and its regulations with microRNAs. Bulk of the research in bioinformatics, however, is to make high throughput measurements at any one of these ‘levels’ (or ‘views’) and look for their functional implications such as finding biomarkers of complex diseases. Such analyzes ignore relationships that exist across these levels of observation, and can also lead to misleading results: for example genes that show similar expression between two cohorts of interest may be differently expressed at the protein level due to regulation being the cause of the difference between the cohorts. In this work, I consider approaches for integrative analysis across these levels of gene expression and study examples from classification and regression (survival prediction). For these studies, I use publicly archived data from The Cancer Genome Atlas (TCGA). For classification and regression, I carry out extensive feature selection using Fisher Ratio and Greedy Forward Feature Selection algorithms, respectively. I quantify the performance of classifiers designed using one view of data, evaluated on another. I then carry out feature selection in an integrative way and found that integrative analysis of multi-omic data might not significantly well perform than the best single level. Another study uses an in-house data measured in a single laboratory. I use this data to show the importance of measuring translation rates (monosome and polysome) in the studies related to transcriptome-proteome correlation. I binarise mRNA data in this analysis to reduce the errors in numerical precision of the biological experiments, cluster the genes based on their binary values and used GO analysis to validate them biologically. Within my clusters, I show that including translation rates in protein prediction models may explain the protein level more accurately than total mRNA alone (conventionally measured). At the same time, I show the impact of some sequence-derived features on these models and how these impacts vary between clusters. Hence, using the whole features on bulk of genes is not an efficient way of doing this prediction.
Text
Final thesis unsigned
Restricted to Repository staff only
More information
Published date: April 2020
Identifiers
Local EPrints ID: 447740
URI: http://eprints.soton.ac.uk/id/eprint/447740
PURE UUID: 58e13bd3-afd3-41e6-b643-447b19f499b0
Catalogue record
Date deposited: 19 Mar 2021 17:31
Last modified: 17 Mar 2024 06:26
Export record
Contributors
Author:
Pratheeba Jeyananthan
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics