The University of Southampton
University of Southampton Institutional Repository

Learning from protein data with protein language models

Learning from protein data with protein language models
Learning from protein data with protein language models
Proteins are complex macromolecules responsible for the vast majority of biological processes. Advancements in genome sequencing have resulted in an exponential increase in sequence data, while other efforts to understand proteins have led to significant progress in determining their structures and interactions. In this thesis, we explore the capability of protein language models (PLMs) to decode and learn from the complexities of protein data with a focus on predicting protein interactions and learning general protein embeddings. We begin by using functional annotations from the gene ontology to learn functional embeddings for proteins and predict their interactions. This approach improves upon classic semantic similarity functions and provides a degree of interpretability regarding the importance of specific gene ontology terms by analyzing the attention heads of the transformer architecture. Our model demonstrated high accuracy in predicting protein-protein interactions across multiple datasets, outperforming existing methods. Next, we shift our focus to learning protein embeddings by investigating sequence conservation at the molecular surface and interaction interfaces to understand the evolutionary aspects of proteins. We develop a multitask pre-training strategy that learns general protein embeddings from sequences, structures, and interaction interfaces, thus alleviating some computational requirements for large PLMs. This strategy leads to improved performance in various downstream tasks. Additionally, we introduce a methodology for generating protein sequence augmentations using evolutionary information which further improves the model’s generalization capabilities. Finally, we explore compressing pre-training sequence datasets using reduced amino acid alphabets and their utility in unsupervised learning. Our results indicate that while reduced alphabets can efficiently capture meaningful embeddings, they may not always outperform models using the full amino acid alphabet. We then devise a simple method to improve single-sequence structure prediction models for proteins with low-depth multiple-sequence alignments by translating protein sequences using reduced amino acid alphabets.
protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning
University of Southampton
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Ewing, Rob
022c5b04-da20-4e55-8088-44d0dc9935ae

Ieremie, Ioan (2024) Learning from protein data with protein language models. University of Southampton, Doctoral Thesis, 166pp.

Record type: Thesis (Doctoral)

Abstract

Proteins are complex macromolecules responsible for the vast majority of biological processes. Advancements in genome sequencing have resulted in an exponential increase in sequence data, while other efforts to understand proteins have led to significant progress in determining their structures and interactions. In this thesis, we explore the capability of protein language models (PLMs) to decode and learn from the complexities of protein data with a focus on predicting protein interactions and learning general protein embeddings. We begin by using functional annotations from the gene ontology to learn functional embeddings for proteins and predict their interactions. This approach improves upon classic semantic similarity functions and provides a degree of interpretability regarding the importance of specific gene ontology terms by analyzing the attention heads of the transformer architecture. Our model demonstrated high accuracy in predicting protein-protein interactions across multiple datasets, outperforming existing methods. Next, we shift our focus to learning protein embeddings by investigating sequence conservation at the molecular surface and interaction interfaces to understand the evolutionary aspects of proteins. We develop a multitask pre-training strategy that learns general protein embeddings from sequences, structures, and interaction interfaces, thus alleviating some computational requirements for large PLMs. This strategy leads to improved performance in various downstream tasks. Additionally, we introduce a methodology for generating protein sequence augmentations using evolutionary information which further improves the model’s generalization capabilities. Finally, we explore compressing pre-training sequence datasets using reduced amino acid alphabets and their utility in unsupervised learning. Our results indicate that while reduced alphabets can efficiently capture meaningful embeddings, they may not always outperform models using the full amino acid alphabet. We then devise a simple method to improve single-sequence structure prediction models for proteins with low-depth multiple-sequence alignments by translating protein sequences using reduced amino acid alphabets.

Text
archival-thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (25MB)
Text
Final-thesis-submission-Examination-Mr-Ioan-Ieremie
Restricted to Repository staff only

More information

Published date: 2024
Keywords: protein, protein interactions, language models, machine learning, protein embeddings, structure prediction, protein language models, unsupervised learning

Identifiers

Local EPrints ID: 495177
URI: http://eprints.soton.ac.uk/id/eprint/495177
PURE UUID: 4366fdc3-e9d0-4919-bbe2-c1580f1f17b9
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X
ORCID for Rob Ewing: ORCID iD orcid.org/0000-0001-6510-4001

Catalogue record

Date deposited: 31 Oct 2024 17:34
Last modified: 01 Nov 2024 02:44

Export record

Contributors

Author: Ioan Ieremie
Thesis advisor: Mahesan Niranjan ORCID iD
Thesis advisor: Rob Ewing ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×