Working across the Omic scales in high throughput data-driven biology

The expression of a gene, encoded as a sequence of nucleotides in the genome of an organism, and its regulations can be observed at various levels: messenger RNA levels, translation levels in the ribosome, the cellular concentration of the resulting protein, mutations and aberrations at the sequence level, epigenetic markers and its regulations with microRNAs. Bulk of the research in bioinformatics, however, is to make high throughput measurements at any one of these ‘levels’ (or ‘views’) and look for their functional implications such as finding biomarkers of complex diseases. Such analyzes ignore relationships that exist across these levels of observation, and can also lead to misleading results: for example genes that show similar expression between two cohorts of interest may be differently expressed at the protein level due to regulation being the cause of the difference between the cohorts. In this work, I consider approaches for integrative analysis across these levels of gene expression and study examples from classification and regression (survival prediction). For these studies, I use publicly archived data from The Cancer Genome Atlas (TCGA). For classification and regression, I carry out extensive feature selection using Fisher Ratio and Greedy Forward Feature Selection algorithms, respectively. I quantify the performance of classifiers designed using one view of data, evaluated on another. I then carry out feature selection in an integrative way and found that integrative analysis of multi-omic data might not significantly well perform than the best single level. Another study uses an in-house data measured in a single laboratory. I use this data to show the importance of measuring translation rates (monosome and polysome) in the studies related to transcriptome-proteome correlation. I binarise mRNA data in this analysis to reduce the errors in numerical precision of the biological experiments, cluster the genes based on their binary values and used GO analysis to validate them biologically. Within my clusters, I show that including translation rates in protein prediction models may explain the protein level more accurately than total mRNA alone (conventionally measured). At the same time, I show the impact of some sequence-derived features on these models and how these impacts vary between clusters. Hence, using the whole features on bulk of genes is not an efficient way of doing this prediction.

University of Southampton

Jeyananthan, Pratheeba

f4c533ad-d3f4-43c5-ae6e-5f201729b71d

April 2020

Jeyananthan, Pratheeba

f4c533ad-d3f4-43c5-ae6e-5f201729b71d

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Jeyananthan, Pratheeba (2020) Working across the Omic scales in high throughput data-driven biology. University of Southampton, Doctoral Thesis, 201pp.

Record type: Thesis (Doctoral)