Mining protein database using machine learning techniques

With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous. We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information. We find that the use of a linear classifier was enough to discriminate a protein pair at the family level. However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem. We find that the use of nonlinear classifiers achieve significantly higher accuracies. In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made. Feature selection points to a "knowledge gap" in currently available functional annotations. We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins.

10.1515/jib-2008-106

1-10

Camargo, Renata

518ba514-12cc-4e72-863e-a00d864d1dfc

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

1 June 2008

Camargo, Renata

518ba514-12cc-4e72-863e-a00d864d1dfc

Niranjan, Mahesan

5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Camargo, Renata and Niranjan, Mahesan (2008) Mining protein database using machine learning techniques. Journal of Integrative Bioinformatics, 5 (2), 1-10. (doi:10.1515/jib-2008-106).