Prediction of properties from simulations: a re-examination with modern statistical methods
Prediction of properties from simulations: a re-examination with modern statistical methods
We discuss models fit to data collected by Duffy and Jorgensen to predict solvation free energies and partition equilibria of drugs, organic molecules, aromatic heterocycles, and other molecules. These data were originally examined using linear regression, but here more recently developed statistical models are applied. The data set is complicated due to the presence of discrepant observations and also curvature in the response. In some cases it is possible to discard a small number of the observations to get good fit to the data, but, in others, discarding an increasing proportion of the observations does not improve the fit. Our general preference is to use robust parameter estimation which downweights to reduce the influence of discrepant observations on the fitted models. Models are selected for four responses using linear or more complicated representations of the explanatory variables, such as cubic polynomials, B-splines, or smoothers via generalized additive models (GAMs). Variables are chosen using the traditional approach of formal tests to assess contribution to the fit of a model, and resampling methods including bootstrap are also considered to assess the prediction error for given models. Results of our analysis indicate that GAMs are an improvement on linear models for describing the data and making predictions. In general robust regression models and GAMs have the smallest conditional expected loss of prediction over the four responses. In addition, robust regression models offer the advantage of identifying molecules that perform poorly in the fit. In general, models were identified that yielded an improvement of approximately 50% in the conditional expected loss of prediction compared with the original parametrization of Duffy and Jorgensen. It was also found that the use of cross-validation to compare models was unreliable, and bootstrapping is preferred.
1791-1803
Mansson, R.A.
a05aed4e-b47b-46d6-a784-1c4077466081
Frey, J.G.
ba60c559-c4af-44f1-87e6-ce69819bf23f
Essex, J.W.
1f409cfe-6ba4-42e2-a0ab-a931826314b5
Welsh, A.H.
27640871-afff-4d45-a191-8a72abee4c1a
December 2005
Mansson, R.A.
a05aed4e-b47b-46d6-a784-1c4077466081
Frey, J.G.
ba60c559-c4af-44f1-87e6-ce69819bf23f
Essex, J.W.
1f409cfe-6ba4-42e2-a0ab-a931826314b5
Welsh, A.H.
27640871-afff-4d45-a191-8a72abee4c1a
Mansson, R.A., Frey, J.G., Essex, J.W. and Welsh, A.H.
(2005)
Prediction of properties from simulations: a re-examination with modern statistical methods.
Journal of Chemical Information and Modeling, 45 (6), .
(doi:10.1021/ci050056i).
Abstract
We discuss models fit to data collected by Duffy and Jorgensen to predict solvation free energies and partition equilibria of drugs, organic molecules, aromatic heterocycles, and other molecules. These data were originally examined using linear regression, but here more recently developed statistical models are applied. The data set is complicated due to the presence of discrepant observations and also curvature in the response. In some cases it is possible to discard a small number of the observations to get good fit to the data, but, in others, discarding an increasing proportion of the observations does not improve the fit. Our general preference is to use robust parameter estimation which downweights to reduce the influence of discrepant observations on the fitted models. Models are selected for four responses using linear or more complicated representations of the explanatory variables, such as cubic polynomials, B-splines, or smoothers via generalized additive models (GAMs). Variables are chosen using the traditional approach of formal tests to assess contribution to the fit of a model, and resampling methods including bootstrap are also considered to assess the prediction error for given models. Results of our analysis indicate that GAMs are an improvement on linear models for describing the data and making predictions. In general robust regression models and GAMs have the smallest conditional expected loss of prediction over the four responses. In addition, robust regression models offer the advantage of identifying molecules that perform poorly in the fit. In general, models were identified that yielded an improvement of approximately 50% in the conditional expected loss of prediction compared with the original parametrization of Duffy and Jorgensen. It was also found that the use of cross-validation to compare models was unreliable, and bootstrapping is preferred.
This record has no associated files available for download.
More information
Published date: December 2005
Organisations:
Statistics
Identifiers
Local EPrints ID: 15866
URI: http://eprints.soton.ac.uk/id/eprint/15866
ISSN: 1549-9596
PURE UUID: 601455f1-18d9-4c99-acfa-2da25b529947
Catalogue record
Date deposited: 22 Aug 2006
Last modified: 16 Mar 2024 02:45
Export record
Altmetrics
Contributors
Author:
R.A. Mansson
Author:
A.H. Welsh
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics