Data Mining and Decision Support in Pharmaceutical Databases

Pasupa, Kitsuchart (2007) Data Mining and Decision Support in Pharmaceutical Databases. University of Sheffield, Automatic Control & Systems Engineering, Doctoral Thesis.

Record type: Thesis (Doctoral)

Abstract

This thesis lies in the area of chemoinformatics, known as virtual screening (VS). VS describes a set of computational methods that provide a fast and cheap alternative to biological screening which involves the selection, synthesis and testing of molecules to ascertain their biological activity in a particular domain, e.g. pain relief, reduction of inflammation. This is important because reducing the cost and, crucially, time in the early stages of compound development can have a disproportionate benefit in profitability in a cycle that has a short patent lifetime. Machine learning methods are becoming popular in this domain but problems arise when 2D fingerprints are used as descriptors. Fingerprints are an extremely sparse, binary-valued representation of molecules. Furthermore, VS also suffers strongly from the so-called "small-sample-size" problem where the number of covariates is comparable to or exceeds the number of samples. These problems can be solved by developing machine learning algorithm which can handle very large sets of high-dimensional data. The high-dimensional data contains an unprecedented level of complexity, hence, some forms of complexity control are therefore necessary. Alternatively a suitable dimensional reduction method can be used. This thesis consists of four major works which are conducted with the MDL Drug Data Report (MDDR) database. The works are as follows: (i) Development of binary kernel discrimination (BKD). (ii) A new algorithm is introduced for kernel machine family, the so-call "parsimonious kernel fisher discrimination". The proposed algorithm is then applied to VS tasks. (iii) Prediction by posterior estimation in VS. (iv) A comparison of four variants of principal component analysis with potential in VS. The experiments show that, BKD in conjunction with Jaccard/Tanimoto is found to be the best method while other approaches are found to be less accurate than BKD but still comparable in a number of cases.

This record has no associated files available for download.