Learning from protein data with protein language models
Learning from protein data with protein language models
Proteins are complex macromolecules responsible for the vast majority of biological processes. Advancements in genome sequencing have resulted in an exponential increase in sequence data, while other efforts to understand proteins have led to significant progress in determining their structures and interactions. In this thesis, we explore the capability of protein language models (PLMs) to decode and learn from the complexities of protein data with a focus on predicting protein interactions and learning general protein embeddings. We begin by using functional annotations from the gene ontology to learn functional embeddings for proteins and predict their interactions. This approach improves upon classic semantic similarity functions and provides a degree of interpretability regarding the importance of specific gene ontology terms by analyzing the attention heads of the transformer architecture. Our model demonstrated high accuracy in predicting protein-protein interactions across multiple datasets, outperforming existing methods. Next, we shift our focus to learning protein embeddings by investigating sequence conservation at the molecular surface and interaction interfaces to understand the evolutionary aspects of proteins. We develop a multitask pre-training strategy that learns general protein embeddings from sequences, structures, and interaction interfaces, thus alleviating some computational requirements for large PLMs. This strategy leads to improved performance in various downstream tasks. Additionally, we introduce a methodology for generating protein sequence augmentations using evolutionary information which further improves the model’s generalization capabilities. Finally, we explore compressing pre-training sequence datasets using reduced amino acid alphabets and their utility in unsupervised learning. Our results indicate that while reduced alphabets can efficiently capture meaningful embeddings, they may not always outperform models using the full amino acid alphabet. We then devise a simple method to improve single-sequence structure prediction models for proteins with low-depth multiple-sequence alignments by translating protein sequences using reduced amino acid alphabets.
protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning
University of Southampton
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
2024
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Ewing, Rob
022c5b04-da20-4e55-8088-44d0dc9935ae
Ieremie, Ioan
(2024)
Learning from protein data with protein language models.
University of Southampton, Doctoral Thesis, 166pp.
Record type:
Thesis
(Doctoral)
Abstract
Proteins are complex macromolecules responsible for the vast majority of biological processes. Advancements in genome sequencing have resulted in an exponential increase in sequence data, while other efforts to understand proteins have led to significant progress in determining their structures and interactions. In this thesis, we explore the capability of protein language models (PLMs) to decode and learn from the complexities of protein data with a focus on predicting protein interactions and learning general protein embeddings. We begin by using functional annotations from the gene ontology to learn functional embeddings for proteins and predict their interactions. This approach improves upon classic semantic similarity functions and provides a degree of interpretability regarding the importance of specific gene ontology terms by analyzing the attention heads of the transformer architecture. Our model demonstrated high accuracy in predicting protein-protein interactions across multiple datasets, outperforming existing methods. Next, we shift our focus to learning protein embeddings by investigating sequence conservation at the molecular surface and interaction interfaces to understand the evolutionary aspects of proteins. We develop a multitask pre-training strategy that learns general protein embeddings from sequences, structures, and interaction interfaces, thus alleviating some computational requirements for large PLMs. This strategy leads to improved performance in various downstream tasks. Additionally, we introduce a methodology for generating protein sequence augmentations using evolutionary information which further improves the model’s generalization capabilities. Finally, we explore compressing pre-training sequence datasets using reduced amino acid alphabets and their utility in unsupervised learning. Our results indicate that while reduced alphabets can efficiently capture meaningful embeddings, they may not always outperform models using the full amino acid alphabet. We then devise a simple method to improve single-sequence structure prediction models for proteins with low-depth multiple-sequence alignments by translating protein sequences using reduced amino acid alphabets.
Text
archival-thesis
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Ioan-Ieremie
Restricted to Repository staff only
More information
Published date: 2024
Keywords:
protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning
Identifiers
Local EPrints ID: 495177
URI: http://eprints.soton.ac.uk/id/eprint/495177
PURE UUID: 4366fdc3-e9d0-4919-bbe2-c1580f1f17b9
Catalogue record
Date deposited: 31 Oct 2024 17:34
Last modified: 01 Nov 2024 02:44
Export record
Contributors
Author:
Ioan Ieremie
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics