Learning from protein data with protein language models

Proteins are complex macromolecules responsible for the vast majority of biological processes. Advancements in genome sequencing have resulted in an exponential increase in sequence data, while other efforts to understand proteins have led to significant progress in determining their structures and interactions. In this thesis, we explore the capability of protein language models (PLMs) to decode and learn from the complexities of protein data with a focus on predicting protein interactions and learning general protein embeddings. We begin by using functional annotations from the gene ontology to learn functional embeddings for proteins and predict their interactions. This approach improves upon classic semantic similarity functions and provides a degree of interpretability regarding the importance of specific gene ontology terms by analyzing the attention heads of the transformer architecture. Our model demonstrated high accuracy in predicting protein-protein interactions across multiple datasets, outperforming existing methods. Next, we shift our focus to learning protein embeddings by investigating sequence conservation at the molecular surface and interaction interfaces to understand the evolutionary aspects of proteins. We develop a multitask pre-training strategy that learns general protein embeddings from sequences, structures, and interaction interfaces, thus alleviating some computational requirements for large PLMs. This strategy leads to improved performance in various downstream tasks. Additionally, we introduce a methodology for generating protein sequence augmentations using evolutionary information which further improves the model’s generalization capabilities. Finally, we explore compressing pre-training sequence datasets using reduced amino acid alphabets and their utility in unsupervised learning. Our results indicate that while reduced alphabets can efficiently capture meaningful embeddings, they may not always outperform models using the full amino acid alphabet. We then devise a simple method to improve single-sequence structure prediction models for proteins with low-depth multiple-sequence alignments by translating protein sequences using reduced amino acid alphabets.

protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning

University of Southampton

Ieremie, Ioan

f7eba675-d7c3-42f9-a1c4-47f51b538acb

2024

Ieremie, Ioan

f7eba675-d7c3-42f9-a1c4-47f51b538acb

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Ewing, Rob

022c5b04-da20-4e55-8088-44d0dc9935ae

Ieremie, Ioan (2024) Learning from protein data with protein language models. University of Southampton, Doctoral Thesis, 166pp.

Record type: Thesis (Doctoral)

Abstract

Text

archival-thesis - Version of Record

Available under License University of Southampton Thesis Licence.

Download (25MB)

Text

Final-thesis-submission-Examination-Mr-Ioan-Ieremie

Restricted to Repository staff only

More information

Published date: 2024

Related URLs:

Keywords: protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning

Learn more about School of Electronics and Computer Science research