Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.

10.1371/journal.pone.0325531

1932-6203

Papadopoulos, Frixos

173204e8-6930-4a3e-b7b6-1ce9918484ad

Sanchez-Elsner, Tilman

b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Heinson, Ashley

822775d1-9379-4bde-99c3-3c031c3100fb

6 August 2025

Papadopoulos, Frixos

173204e8-6930-4a3e-b7b6-1ce9918484ad

Sanchez-Elsner, Tilman

b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Heinson, Ashley

822775d1-9379-4bde-99c3-3c031c3100fb

Papadopoulos, Frixos, Sanchez-Elsner, Tilman, Niranjan, Mahesan and Heinson, Ashley (2025) Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference. PLoS ONE, 20 (8), [e0325531]. (doi:10.1371/journal.pone.0325531).

Record type: Article

Abstract

Text

journal.pone.0325531 (1) - Version of Record

Available under License Creative Commons Attribution.

Download (3MB)

More information

Accepted/In Press date: 14 May 2025

Published date: 6 August 2025

Learn more about Vision, Learning and Control research Learn more about Institute for Life Sciences research Learn more about School of Electronics and Computer Science research Learn more about Institute for Life Sciences research

Identifiers

Local EPrints ID: 504658

URI: http://eprints.soton.ac.uk/id/eprint/504658

DOI: doi:10.1371/journal.pone.0325531

ISSN: 1932-6203

PURE UUID: 01f23cd3-0510-409e-b4c7-12ecc9c3da9c

ORCID for Frixos Papadopoulos:

orcid.org/0000-0002-4429-1562

ORCID for Tilman Sanchez-Elsner:

orcid.org/0000-0003-1915-2410

ORCID for Mahesan Niranjan:

orcid.org/0000-0001-7021-140X

ORCID for Ashley Heinson:

orcid.org/0000-0001-8695-6203

Catalogue record

Date deposited: 16 Sep 2025 17:15

Last modified: 18 Nov 2025 03:00

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Frixos Papadopoulos

Author: Tilman Sanchez-Elsner

Author: Mahesan Niranjan

Author: Ashley Heinson

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information