The University of Southampton
University of Southampton Institutional Repository

Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.

Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.
Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.
Background: large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable.

Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.

Results: a widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly.

Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p.

Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues).

Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.

Conclusions: a method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.
1471-2105
14
Davey, Norman E.
bdaded6d-ac23-4a43-b347-3113159dfb70
Edwards, Richard J.
9d25e74f-dc0d-455a-832c-5f363d864c43
Shields, Denis C.
57ffee4f-0277-4b3d-9c7a-8c328637d8e6
Davey, Norman E.
bdaded6d-ac23-4a43-b347-3113159dfb70
Edwards, Richard J.
9d25e74f-dc0d-455a-832c-5f363d864c43
Shields, Denis C.
57ffee4f-0277-4b3d-9c7a-8c328637d8e6

Davey, Norman E., Edwards, Richard J. and Shields, Denis C. (2010) Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins. BMC Bioinformatics, 11 (14), 14. (doi:10.1186/1471-2105-11-14).

Record type: Article

Abstract

Background: large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable.

Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.

Results: a widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly.

Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p.

Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues).

Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.

Conclusions: a method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

Text
Estimation_and_efficient_computation_of_the_true_probability_of_recurrence_of_short_linear_protein_sequence_motifs_in_unrelated_proteins.pdf - Version of Record
Available under License Other.
Download (558kB)

More information

Published date: 7 January 2010

Identifiers

Local EPrints ID: 142439
URI: http://eprints.soton.ac.uk/id/eprint/142439
ISSN: 1471-2105
PURE UUID: 4e791f44-281a-43ba-a778-1e765c91abbb

Catalogue record

Date deposited: 01 Apr 2010 10:18
Last modified: 14 Mar 2024 00:39

Export record

Altmetrics

Contributors

Author: Norman E. Davey
Author: Richard J. Edwards
Author: Denis C. Shields

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×