Modelling at the transcriptome - proteome interface
Modelling at the transcriptome - proteome interface
In high-throughput experimental biology, it is widely acknowledged that mRNA expression levels and the corresponding protein abundances are jointly analysed to observe the relationship between these two omic measurements. While some experiments have shown a good correlation between transcriptome and proteome for some species under different conditions, such correlation values are not universal due to post-transcriptional and post-translational regulations. Thus, bridging the gap between transcriptome and proteome measurements allow us to uncover useful biological insights of the above regulations which are important to study on protein generation process and several disease conditions. We develop a data-driven predictor using transcriptome layer properties as proxies to protein abundance and employ the model in a novel manner to detect posttranslationally regulated proteins, hypothesizing that model failures (outlier proteins) occur due to protein stability disruption by post-translational modifications (PTMs). Three outlier detection techniques were employed with our protein abundance predictor to detect post-translationally regulated protein. Those are; (1) simple linear regression model which detects outliers by looking at the predicted and the measured protein scatter plot, (2) Outlier Rejecting Regression (ORR) model, a novel mathematical formulation which returns user-specific fraction of the data as outliers by solving a non-convex optimization problem using Difference of Convex functions Algorithm (DCA) and (3) Quantile Regression (QR) which employs an asymmetric loss model to detect outliers only with negative losses for the first time in omic world. Proteins extracted as outliers using above techniques confirmed our hypothesis on post-translational regulation (PTR) by providing high statistical confidence for functional annotations and pathway information. Therefore, this data-driven framework can be used as a reliable technique for biologists to reduce laboratory experimental workspace in detecting post-translationally regulated proteins.
We also perform a thorough inference analysis on most commonly used high-throughput microarray and RNA-Seq measurements using several machine learning inference techniques to observe whether their high numerical precision provides additional information about the gene with respect to the binary representation of gene switch on/off status. We perform this analysis at the transcriptome level and as well at the proteome level as an extended experimental setting of our PTR detection framework. These analyses suggest that binarized mRNA concentrations, which are measured using high-throughput RNA-Seq and microarray technologies are sufficient to perform accurate machine learning inferences similar to continuous measurements, not only at the transcriptome level but also at the proteome level to predict protein abundance and to detect protein with post-translation regulation to a high confidence level.
Gunawardana, Yawwani P.
95b97b4c-ed93-4f95-bbc2-e6d81163e0f0
November 2015
Gunawardana, Yawwani P.
95b97b4c-ed93-4f95-bbc2-e6d81163e0f0
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Gunawardana, Yawwani P.
(2015)
Modelling at the transcriptome - proteome interface.
University of Southampton, Physical Sciences and Engineering, Doctoral Thesis, 211pp.
Record type:
Thesis
(Doctoral)
Abstract
In high-throughput experimental biology, it is widely acknowledged that mRNA expression levels and the corresponding protein abundances are jointly analysed to observe the relationship between these two omic measurements. While some experiments have shown a good correlation between transcriptome and proteome for some species under different conditions, such correlation values are not universal due to post-transcriptional and post-translational regulations. Thus, bridging the gap between transcriptome and proteome measurements allow us to uncover useful biological insights of the above regulations which are important to study on protein generation process and several disease conditions. We develop a data-driven predictor using transcriptome layer properties as proxies to protein abundance and employ the model in a novel manner to detect posttranslationally regulated proteins, hypothesizing that model failures (outlier proteins) occur due to protein stability disruption by post-translational modifications (PTMs). Three outlier detection techniques were employed with our protein abundance predictor to detect post-translationally regulated protein. Those are; (1) simple linear regression model which detects outliers by looking at the predicted and the measured protein scatter plot, (2) Outlier Rejecting Regression (ORR) model, a novel mathematical formulation which returns user-specific fraction of the data as outliers by solving a non-convex optimization problem using Difference of Convex functions Algorithm (DCA) and (3) Quantile Regression (QR) which employs an asymmetric loss model to detect outliers only with negative losses for the first time in omic world. Proteins extracted as outliers using above techniques confirmed our hypothesis on post-translational regulation (PTR) by providing high statistical confidence for functional annotations and pathway information. Therefore, this data-driven framework can be used as a reliable technique for biologists to reduce laboratory experimental workspace in detecting post-translationally regulated proteins.
We also perform a thorough inference analysis on most commonly used high-throughput microarray and RNA-Seq measurements using several machine learning inference techniques to observe whether their high numerical precision provides additional information about the gene with respect to the binary representation of gene switch on/off status. We perform this analysis at the transcriptome level and as well at the proteome level as an extended experimental setting of our PTR detection framework. These analyses suggest that binarized mRNA concentrations, which are measured using high-throughput RNA-Seq and microarray technologies are sufficient to perform accurate machine learning inferences similar to continuous measurements, not only at the transcriptome level but also at the proteome level to predict protein abundance and to detect protein with post-translation regulation to a high confidence level.
Text
Yawwani Gunawardana (25813722) -PhD Thesis.pdf
- Other
More information
Published date: November 2015
Organisations:
University of Southampton, Vision, Learning and Control
Identifiers
Local EPrints ID: 386877
URI: http://eprints.soton.ac.uk/id/eprint/386877
PURE UUID: ce932fcc-670e-4edd-8475-483c3df1a283
Catalogue record
Date deposited: 12 Feb 2016 14:31
Last modified: 15 Mar 2024 03:29
Export record
Contributors
Author:
Yawwani P. Gunawardana
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics