Integrative modelling of protein abundance via sequence information
Integrative modelling of protein abundance via sequence information
Understanding the complex interactions between the transcriptome and proteome is essential in uncovering cellular mechanisms both in health and disease contexts. The underwhelming correlation between corresponding transcript and protein abundance suggests that regulatory processes tightly govern information flow surrounding transcription, translation and post translation; particularly in higher order organisms. Inherent difficulties associated with global proteome measurement make modelling protein abundance via proxies desirable, given the pivotal role that intra-cellular proteins play in cell regulation and function. In this thesis, a protein abundance predictor is developed across the human cell cycle using mRNA and translation abundance, determining that mRNA level alone insufficiently explains the transcriptome-proteome relationship. To expand the feature space, some 30 sequence-derived features (SDFs) were engineered that impact proteins before translation, and we demonstrated in our published works that overestimated outliers to fitted models (r 2 = 0.67) are associated with posttranslational regulation and degradation. It made sense then to expand on the concept of using sequence-engineered features as generalized predictors to expression; a large dataset was curated covering the entire human transcriptome to derive over 180 new features, spanning from genome to estimated post-translational modifications. SDFs were designed with scale and generality in mind; allowing for their application in a variety of ’omic studies. This newly generated resource was validated by systematically analysing intra-feature correlations and unsupervised learning techniques to mitigate inevitable multicollinearity. Finally, global protein abundance prediction using SDFs was attempted, finding that sequence information alone leads to model scores of r 2 = 0.45, with mRNA abundance included adding 5% to explaining model variance. Unpacking fitted SDF models using gene ontology analysis revealed a close relationship between SDFs and translation; helping to explain their improved model performance over mRNA level. This data driven approach helps to isolate proteins of interest by outlier detection, with SDF use biased towards predicting steady-state protein abundance.
University of Southampton
Parkes, Gregory, Michael
af295bd0-62d8-4ff5-a325-251c59c7537d
February 2022
Parkes, Gregory, Michael
af295bd0-62d8-4ff5-a325-251c59c7537d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Parkes, Gregory, Michael
(2022)
Integrative modelling of protein abundance via sequence information.
University of Southampton, Doctoral Thesis, 223pp.
Record type:
Thesis
(Doctoral)
Abstract
Understanding the complex interactions between the transcriptome and proteome is essential in uncovering cellular mechanisms both in health and disease contexts. The underwhelming correlation between corresponding transcript and protein abundance suggests that regulatory processes tightly govern information flow surrounding transcription, translation and post translation; particularly in higher order organisms. Inherent difficulties associated with global proteome measurement make modelling protein abundance via proxies desirable, given the pivotal role that intra-cellular proteins play in cell regulation and function. In this thesis, a protein abundance predictor is developed across the human cell cycle using mRNA and translation abundance, determining that mRNA level alone insufficiently explains the transcriptome-proteome relationship. To expand the feature space, some 30 sequence-derived features (SDFs) were engineered that impact proteins before translation, and we demonstrated in our published works that overestimated outliers to fitted models (r 2 = 0.67) are associated with posttranslational regulation and degradation. It made sense then to expand on the concept of using sequence-engineered features as generalized predictors to expression; a large dataset was curated covering the entire human transcriptome to derive over 180 new features, spanning from genome to estimated post-translational modifications. SDFs were designed with scale and generality in mind; allowing for their application in a variety of ’omic studies. This newly generated resource was validated by systematically analysing intra-feature correlations and unsupervised learning techniques to mitigate inevitable multicollinearity. Finally, global protein abundance prediction using SDFs was attempted, finding that sequence information alone leads to model scores of r 2 = 0.45, with mRNA abundance included adding 5% to explaining model variance. Unpacking fitted SDF models using gene ontology analysis revealed a close relationship between SDFs and translation; helping to explain their improved model performance over mRNA level. This data driven approach helps to isolate proteins of interest by outlier detection, with SDF use biased towards predicting steady-state protein abundance.
Text
THESIS
- Version of Record
Text
PTD_Thesis_Parkes-SIGNED
Restricted to Repository staff only
More information
Submitted date: August 2021
Published date: February 2022
Identifiers
Local EPrints ID: 457300
URI: http://eprints.soton.ac.uk/id/eprint/457300
PURE UUID: f6f60585-93c2-4712-a985-10bc962bbd79
Catalogue record
Date deposited: 31 May 2022 16:37
Last modified: 17 Mar 2024 03:11
Export record
Contributors
Author:
Gregory, Michael Parkes
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics