Protein language models meet reduced amino acid alphabets
Protein language models meet reduced amino acid alphabets
Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.
Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Ewing, Rob M.
022c5b04-da20-4e55-8088-44d0dc9935ae
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Gao, Xin
6890237a-052c-41eb-a3c2-549dcd4b8c3b
13 February 2024
Ieremie, Ioan
f7eba675-d7c3-42f9-a1c4-47f51b538acb
Ewing, Rob M.
022c5b04-da20-4e55-8088-44d0dc9935ae
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Gao, Xin
6890237a-052c-41eb-a3c2-549dcd4b8c3b
Ieremie, Ioan, Ewing, Rob M. and Niranjan, Mahesan
,
Gao, Xin
(ed.)
(2024)
Protein language models meet reduced amino acid alphabets.
Bioinformatics, 40 (2), [btae061].
(doi:10.1093/bioinformatics/btae061).
Abstract
Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.
Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.
Text
btae061
- Version of Record
More information
Accepted/In Press date: 30 January 2024
e-pub ahead of print date: 3 February 2024
Published date: 13 February 2024
Identifiers
Local EPrints ID: 495957
URI: http://eprints.soton.ac.uk/id/eprint/495957
ISSN: 1367-4803
PURE UUID: 71c9e159-44bf-4cc6-8cd4-7b76cc9612a6
Catalogue record
Date deposited: 28 Nov 2024 17:32
Last modified: 30 Nov 2024 02:48
Export record
Altmetrics
Contributors
Author:
Ioan Ieremie
Author:
Mahesan Niranjan
Editor:
Xin Gao
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics