Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference
Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference
Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.
Papadopoulos, Frixos
173204e8-6930-4a3e-b7b6-1ce9918484ad
Sanchez-Elsner, Tilman
b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Heinson, Ashley
822775d1-9379-4bde-99c3-3c031c3100fb
6 August 2025
Papadopoulos, Frixos
173204e8-6930-4a3e-b7b6-1ce9918484ad
Sanchez-Elsner, Tilman
b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Heinson, Ashley
822775d1-9379-4bde-99c3-3c031c3100fb
Papadopoulos, Frixos, Sanchez-Elsner, Tilman, Niranjan, Mahesan and Heinson, Ashley
(2025)
Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference.
PLoS ONE, 20 (8), [e0325531].
(doi:10.1371/journal.pone.0325531).
Abstract
Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.
Text
journal.pone.0325531 (1)
- Version of Record
More information
Accepted/In Press date: 14 May 2025
Published date: 6 August 2025
Identifiers
Local EPrints ID: 504658
URI: http://eprints.soton.ac.uk/id/eprint/504658
ISSN: 1932-6203
PURE UUID: 01f23cd3-0510-409e-b4c7-12ecc9c3da9c
Catalogue record
Date deposited: 16 Sep 2025 17:15
Last modified: 17 Sep 2025 02:06
Export record
Altmetrics
Contributors
Author:
Frixos Papadopoulos
Author:
Mahesan Niranjan
Author:
Ashley Heinson
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics