The University of Southampton
University of Southampton Institutional Repository

Protein language models meet reduced amino acid alphabets

Protein language models meet reduced amino acid alphabets
Protein language models meet reduced amino acid alphabets

Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.

Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.

1367-4803
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Ewing, Rob M.
022c5b04-da20-4e55-8088-44d0dc9935ae
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Gao, Xin
6890237a-052c-41eb-a3c2-549dcd4b8c3b
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Ewing, Rob M.
022c5b04-da20-4e55-8088-44d0dc9935ae
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Gao, Xin
6890237a-052c-41eb-a3c2-549dcd4b8c3b

Ieremie, Ioan, Ewing, Rob M. and Niranjan, Mahesan , Gao, Xin (ed.) (2024) Protein language models meet reduced amino acid alphabets. Bioinformatics, 40 (2), [btae061]. (doi:10.1093/bioinformatics/btae061).

Record type: Article

Abstract

Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.

Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.

Text
btae061 - Version of Record
Available under License Creative Commons Attribution.
Download (2MB)

More information

Accepted/In Press date: 30 January 2024
e-pub ahead of print date: 3 February 2024
Published date: 13 February 2024

Identifiers

Local EPrints ID: 495957
URI: http://eprints.soton.ac.uk/id/eprint/495957
ISSN: 1367-4803
PURE UUID: 71c9e159-44bf-4cc6-8cd4-7b76cc9612a6
ORCID for Rob M. Ewing: ORCID iD orcid.org/0000-0001-6510-4001
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 28 Nov 2024 17:32
Last modified: 30 Nov 2024 02:48

Export record

Altmetrics

Contributors

Author: Ioan Ieremie
Author: Rob M. Ewing ORCID iD
Author: Mahesan Niranjan ORCID iD
Editor: Xin Gao

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×