The University of Southampton
University of Southampton Institutional Repository

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference
Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference
Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.
1932-6203
Papadopoulos, Frixos
173204e8-6930-4a3e-b7b6-1ce9918484ad
Sanchez-Elsner, Tilman
b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Heinson, Ashley
822775d1-9379-4bde-99c3-3c031c3100fb
Papadopoulos, Frixos
173204e8-6930-4a3e-b7b6-1ce9918484ad
Sanchez-Elsner, Tilman
b8799f8d-e2b4-4b37-b77c-f2f0e8e2070d
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Heinson, Ashley
822775d1-9379-4bde-99c3-3c031c3100fb

Papadopoulos, Frixos, Sanchez-Elsner, Tilman, Niranjan, Mahesan and Heinson, Ashley (2025) Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference. PLoS ONE, 20 (8), [e0325531]. (doi:10.1371/journal.pone.0325531).

Record type: Article

Abstract

Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.

Text
journal.pone.0325531 (1) - Version of Record
Available under License Creative Commons Attribution.
Download (3MB)

More information

Accepted/In Press date: 14 May 2025
Published date: 6 August 2025

Identifiers

Local EPrints ID: 504658
URI: http://eprints.soton.ac.uk/id/eprint/504658
ISSN: 1932-6203
PURE UUID: 01f23cd3-0510-409e-b4c7-12ecc9c3da9c
ORCID for Frixos Papadopoulos: ORCID iD orcid.org/0000-0002-4429-1562
ORCID for Tilman Sanchez-Elsner: ORCID iD orcid.org/0000-0003-1915-2410
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X
ORCID for Ashley Heinson: ORCID iD orcid.org/0000-0001-8695-6203

Catalogue record

Date deposited: 16 Sep 2025 17:15
Last modified: 17 Sep 2025 02:06

Export record

Altmetrics

Contributors

Author: Frixos Papadopoulos ORCID iD
Author: Mahesan Niranjan ORCID iD
Author: Ashley Heinson ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×