The University of Southampton
University of Southampton Institutional Repository

Crystalline Cheminformatics - Big Data Approaches to Crystal Engineering

Crystalline Cheminformatics - Big Data Approaches to Crystal Engineering
Crystalline Cheminformatics - Big Data Approaches to Crystal Engineering
Statistical approaches to chemistry, under the umbrella of cheminformatics, are now widespread - in particular as a part of quantitative activity structure relationship and quantitative property structure relationship studies on candidate pharmaceutical studies. Using such approaches on legacy data has widely been termed “taking a big data approach”, and finds ready application in cohort medicinal studies and psychological studies.

Crystallography is a field ripe for these approaches, owing in no small part to its history as a field which, by necessity, adopted digital technologies relatively early on as a part of X-ray crystallographic techniques. A discussion of the historical background of crystallography, crystallographic engineering and of the pertinent areas of cheminformatics, which includes programming, databases, file formats, and statistics is given as background to the presented research.

Presented here are a series of applications of Big Data techniques within the field of crystallography. Firstly, a naıve attempt at descriptor selection was attempted using a family of sulphonamide crystal structures and glycine crystal structures. This proved to be unsuccessful owing to the very large number of available descriptors and the very small number of true glycine polymorphs used in the experiment.

Secondly, an attempt to combine machine learning model building with feature selection was made using co-crystal structures obtained from the Cambridge Structural Database, using partition modelling. This method established sensible sets of descriptors which would act as strong predictors for the formation of co-crystals, however, validation of the models by using them to make predictions demonstrated the poor predictive power of the models, and let to the uncovering of a number of weaknesses therein.

Thirdly, a homologous series of fluorobenzeneanilides were used as a test bed for a novel, invariant topological descriptor. The descriptor itself is based from graph theoretical techniques, and is derived from the patterns of close-contacts within the crystal structure. Fluorobenzeneanilides present an interesting case in this context, because of the historical understanding that fluorine is rarely known to be a component in a hydrogen bonding system. Regardless, the descriptor correlates with the melting point of the fluorobenzeneanilides, with one exception. The reasons for this exception are explored.

In addition, a comparison of categorisations of the crystal structure using more traditional “by-eye” techniques, and groupings of compounds by shared values of the invariant descriptor were undertaken. It is demonstrated that the novel descriptor does not simply act a proxy for the arrangement of the molecules in the crystal lattice- intuitively similar structures have different values for the descriptor while very different structures can have similar values. This is evidence that the general trend of exploring intermolecular contacts
in isolation from other influences over lattice formation. The correlation of the descriptor with melting point in this context suggests that the properties of crystalline material are not only products of their lattice structure.
Also presented as part of all of the case studies is an illustration of some weaknesses of the methodology, and a discussion of how these difficulties can be overcome, both by individual scientists and by necessary alterations to the collective approach to recording crystallographic experiments.
University of Southampton
Adler, Philip, David Felix
40038eb4-9456-4c22-b3ad-a7bbfdcb3680
Adler, Philip, David Felix
40038eb4-9456-4c22-b3ad-a7bbfdcb3680
Coles, Simon
3116f58b-c30c-48cf-bdd5-397d1c1fecf8

Adler, Philip, David Felix (2015) Crystalline Cheminformatics - Big Data Approaches to Crystal Engineering. University of Southampton, Doctoral Thesis, 299pp.

Record type: Thesis (Doctoral)

Abstract

Statistical approaches to chemistry, under the umbrella of cheminformatics, are now widespread - in particular as a part of quantitative activity structure relationship and quantitative property structure relationship studies on candidate pharmaceutical studies. Using such approaches on legacy data has widely been termed “taking a big data approach”, and finds ready application in cohort medicinal studies and psychological studies.

Crystallography is a field ripe for these approaches, owing in no small part to its history as a field which, by necessity, adopted digital technologies relatively early on as a part of X-ray crystallographic techniques. A discussion of the historical background of crystallography, crystallographic engineering and of the pertinent areas of cheminformatics, which includes programming, databases, file formats, and statistics is given as background to the presented research.

Presented here are a series of applications of Big Data techniques within the field of crystallography. Firstly, a naıve attempt at descriptor selection was attempted using a family of sulphonamide crystal structures and glycine crystal structures. This proved to be unsuccessful owing to the very large number of available descriptors and the very small number of true glycine polymorphs used in the experiment.

Secondly, an attempt to combine machine learning model building with feature selection was made using co-crystal structures obtained from the Cambridge Structural Database, using partition modelling. This method established sensible sets of descriptors which would act as strong predictors for the formation of co-crystals, however, validation of the models by using them to make predictions demonstrated the poor predictive power of the models, and let to the uncovering of a number of weaknesses therein.

Thirdly, a homologous series of fluorobenzeneanilides were used as a test bed for a novel, invariant topological descriptor. The descriptor itself is based from graph theoretical techniques, and is derived from the patterns of close-contacts within the crystal structure. Fluorobenzeneanilides present an interesting case in this context, because of the historical understanding that fluorine is rarely known to be a component in a hydrogen bonding system. Regardless, the descriptor correlates with the melting point of the fluorobenzeneanilides, with one exception. The reasons for this exception are explored.

In addition, a comparison of categorisations of the crystal structure using more traditional “by-eye” techniques, and groupings of compounds by shared values of the invariant descriptor were undertaken. It is demonstrated that the novel descriptor does not simply act a proxy for the arrangement of the molecules in the crystal lattice- intuitively similar structures have different values for the descriptor while very different structures can have similar values. This is evidence that the general trend of exploring intermolecular contacts
in isolation from other influences over lattice formation. The correlation of the descriptor with melting point in this context suggests that the properties of crystalline material are not only products of their lattice structure.
Also presented as part of all of the case studies is an illustration of some weaknesses of the methodology, and a discussion of how these difficulties can be overcome, both by individual scientists and by necessary alterations to the collective approach to recording crystallographic experiments.

Text
thesis (1)
Restricted to Repository staff only until 8 May 2020.
Available under License University of Southampton Thesis Licence.

More information

Published date: January 2015
Organisations: University of Southampton, Chemistry

Identifiers

Local EPrints ID: 410940
URI: https://eprints.soton.ac.uk/id/eprint/410940
PURE UUID: 3e40bda2-7751-4679-b21b-18eb879efa59
ORCID for Simon Coles: ORCID iD orcid.org/0000-0001-8414-9272

Catalogue record

Date deposited: 12 Jun 2017 16:31
Last modified: 14 Mar 2019 01:48

Export record

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×