Integrative modelling of protein abundance via sequence information

Understanding the complex interactions between the transcriptome and proteome is essential in uncovering cellular mechanisms both in health and disease contexts. The underwhelming correlation between corresponding transcript and protein abundance suggests that regulatory processes tightly govern information flow surrounding transcription, translation and post translation; particularly in higher order organisms. Inherent difficulties associated with global proteome measurement make modelling protein abundance via proxies desirable, given the pivotal role that intra-cellular proteins play in cell regulation and function. In this thesis, a protein abundance predictor is developed across the human cell cycle using mRNA and translation abundance, determining that mRNA level alone insufficiently explains the transcriptome-proteome relationship. To expand the feature space, some 30 sequence-derived features (SDFs) were engineered that impact proteins before translation, and we demonstrated in our published works that overestimated outliers to fitted models (r 2 = 0.67) are associated with posttranslational regulation and degradation. It made sense then to expand on the concept of using sequence-engineered features as generalized predictors to expression; a large dataset was curated covering the entire human transcriptome to derive over 180 new features, spanning from genome to estimated post-translational modifications. SDFs were designed with scale and generality in mind; allowing for their application in a variety of ’omic studies. This newly generated resource was validated by systematically analysing intra-feature correlations and unsupervised learning techniques to mitigate inevitable multicollinearity. Finally, global protein abundance prediction using SDFs was attempted, finding that sequence information alone leads to model scores of r 2 = 0.45, with mRNA abundance included adding 5% to explaining model variance. Unpacking fitted SDF models using gene ontology analysis revealed a close relationship between SDFs and translation; helping to explain their improved model performance over mRNA level. This data driven approach helps to isolate proteins of interest by outlier detection, with SDF use biased towards predicting steady-state protein abundance.

University of Southampton

Parkes, Gregory, Michael

af295bd0-62d8-4ff5-a325-251c59c7537d

February 2022

Parkes, Gregory, Michael

af295bd0-62d8-4ff5-a325-251c59c7537d

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Parkes, Gregory, Michael (2022) Integrative modelling of protein abundance via sequence information. University of Southampton, Doctoral Thesis, 223pp.

Record type: Thesis (Doctoral)